High Availability and Fault Tolerance

Most users that I’ve talked with are having a difficult time in differentiating these two kinds of systems. Hopefully, this article would give all of us some light on how these two different IT infrastructures are different from each other and hopefully would give light on how to deal and match these with their respective Service Level Agreements.                                                             Image

High Availability means that the infrastructure has been set so that it STILL GIVES OFF MINOR INTERRUPTIONS due to the following factors:

  • Components are not fully fault tolerant.
  • Components are designed and placed so that it is redundant with each of the component. An example of these are two database servers which are mirrored with each other, in which, if one fails, the other one handles the processing. The servers are not exactly fault tolerant or zero tolerant, but the downtime will be minimized, if the second server switches automatically or is switched manually, if the first server goes down.

In short, high availability is a mixture of several components which will handle the processing load, if the a similar component goes down. There could be a small downtime IF, the components were configured or set automatically, or even if the components are switched manually if a similar component goes down.

Normally, creating a high available structure would entail a user to always create two copies of everything, using two different equipment so that if the first goes down, the second goes on, until the first equipment is repaired or diagnosed.

Fault Tolerance on the other hand, is the same as Zero Tolerance, in which, if we dig in deeper, would mean, ZERO TOLERANCE to downtime. It simply means, that the infrastructure cannot go down or be unavailable. Normally, these infrastructure contain fault tolerant equipment that contain two motherboards, two hard disks, two power supplies, two memory modules that are integrated within a central resource unit (or what we call a chassis), in an active-active mode, meaning, both are working and replicating on real-time. This ensure that if a component goes down, the other one still continues to be working, therefore, eliminating the unavailability factor.

In reality, high availability components are cheaper per unit, but would be more expensive to implement because of:

  • Two units of equipment each.
  • Two licenses per equipment each.

On a per unit basis, the costs of equipment having a fault-tolerant mode, is more expensive, but implementations-wise, it would be cheaper because:

  • You would only get one fault-tolerant machine.
  • You would only pay one license per equipment, instead of two. Image
  • Easier to manage and to maintain.

However and whatever it goes, it would depend on each user and business needs to decide and choose if high availability or fault tolerance would best work on his scenario. But consider these:

  • If the business requires second by second updates on transactions, fault-tolerance is advised. Examples of these are banks and stock exchanges.
  • If the business do not do these, then probably, high availability is better.

But then consider the costs:

  • For high availability, the costs involved are: two units of each of the equipment/component of the infrastructure (example: two application servers, two database servers, two routers, two firewalls), all configured as redundant.
  • For high availability, you would need to license two application servers, two database servers, etc.)
  • Also for high availability, you would need skilled staff to manage the redundance of the servers. It would be very costly, if the business let’s say, manage redundant Oracle RAC servers.

Apparently, for these cases I’ve enumerated above, the cost of Fault Tolerant equipment pay for itself in the long run. Although the market has now recognized these facts and has now created solutions for the fault-tolerant requirements of each businesses, we still have not seen quite a handful of these products in the market today, except for a few server providers which offer 100% fault-tolerant solutions. Other manufacturers have also started introducing semi-fault tolerant solutions, but somehow, it may not be the solution that offers and promises a 100% availability promise.

Let’s wait for around two years more before the market matures on these.