Fault tolerance is the ability of an IT system or service to continue to function correctly even in the event of failures in individual components. Fault tolerance is critically important for modern IT services, as downtime can lead to serious financial losses (up to millions of dollars per hour for large online businesses), and unavailability of services negatively affects the reputation of the business.
Fault tolerance is measured by the system availability indicator, which is usually expressed as a percentage of uptime or in the so-called “nines”:
- Three nines (99.9%): allowable downtime is about 8.8 hours per year;
- Four nines (99.99%): allowable downtime is about 52 minutes per year;
- Five nines (99.999%): allowable downtime is about 5 minutes per year;
- Six nines (99.9999%): allowable downtime is about 31 seconds per year.
The more “nines” in the availability index, the higher the requirements for system reliability and fault tolerance, the more expensive the organization and maintenance of fault-tolerant infrastructure.
The main principle of building fault-tolerant systems is to provide redundancy. Redundancy (standby) is a strategy in which additional (standby) resources are added to the system beyond the minimum required for operation. Redundancy allows the system to continue operating in case of failure of one or more components.
Building a fault-tolerant infrastructure requires significant investment, so it is important to find a balance between the cost of solutions and the required level of reliability. Professionally designed fault-tolerant infrastructure ensures not only business continuity, but also a competitive advantage, increasing the trust of clients and partners in the reliability of services provided.