Sunday 2 December 2007

Understanding High Availability

This is a short article to define the concepts of High Availability to some one who is familair with Application Software but not with Infrastructure

High Availability is an option to create fault tolerant solutions to provide available and uninterrupted services to the users. High Available Solutions are developed by providing redundancy. Redundancy is provided by allowing a backup take over the role of the failed to avoid disruption of Service (Failover). Failover can be Stateless or Stateful. In a Stateless failover the solution providing the services is restarted without any restoration of data where as in a Stateful failover the data is check pointed to a standby hosting the solution offering the same services. High Availability needs to cater to both planned and unplanned types of service disruptions.


Availability is measured by the formula
= MTBF / (MTTR + MTBF)

Where MTBF is Mean Time between Failures. Calculated based on expected life of various components of a system

MTTR is Mean Time to Repair. Measurement of time required to bring a system back online after failure

High Availability is defined in terms of number of 9’s as shown in the following table
Number of 9’s Downtime / Year Typical Application

99.9% ~ 9 Hours Desk Top
99.99% ~ 1 Hour Enterprise Server
99.999% ~ 5 Minutes Carrier Class Switch
99.9999% ~ 30 Seconds Carrier Switch Equipment

Failover Design Principles

The basic design principles that were considered while evaluating the failover design are

Ø The failover design needs to be simple and transparent to the client who access the servers services
Ø The failover needs to be quick and the standby needs to be always booted up.
Ø There should be minimal manual intervention and the tasks involved in failover should be automated.
Ø The failover should ensure guaranteed data access, the receiving host should see exactly the same data as the original host.
Ø The failover design should not be designed only for current configuration but needs to consider future growth.
Ø The Operating System needs to be compatible with the failover design.

Failover designs are based on the concepts of Clustering, Load Balancing and Shared Storage.

Clustering

Clustering, by definition, is to configure a secondary system as a standby for the primary system. If the primary system fails, the standby system automatically steps in with minimum disruption in service, takes over the functions of the original primary. The downtime due to a failure in the primary is limited to the takeover time.


A cluster requires various components to turn two systems / servers into a failover pair Servers. The components required are Network Connections (heartbeat network, service network, administrative network), Disks (Shared /Unshared), Application Portability and the basic design principle is there would be no Single Point of failure.


Load Balancing

Load balancing optimizes the performance of a system by distributing the traffic efficiently among a group of network servers which results in high availability and scalability of the system.

A single virtual IP address which maps to the addresses of each of the servers hosting the solution is reflected on the router-based load balancer. If a host is removed from and or becomes unavailable, the incoming request does not risk of hitting a “dead” server, since all host machines of the solution connected to the load balancer appear to have the same IP address.When a response is returned, the client sees it coming from the load balancer. In other words, the client is dealing with a single machine, the router based load balancer


Load balancers support different load-balancing techniques, which can be setup and configured
Ø Round robin. Assigns connections sequentially among severs in the cluster.
Ø Least connections. The server with the least number of connections gets the next connection.
Ø Weighted distribution. Divides the load among the servers based on individual weight or priority assigned to each server in the cluster. This technique is often used for the heterogeneous clusters that consist of servers running on different platforms.


Storage Area Network

Storage area network is a network for interconnecting storage and computing systems into a shared pool that many different hosts can access. SANs are based on Fiber Channel (FC) or SCSI over IP (iSCSI). SANs can be based on a number of different underlying technologies, including IP-based networks. The storage devices on the SAN are usually limited to disks and tape drives.

SAN is central for configurations, such as N+1, 2N or even the complex N-to-1, N-to-N, to failover in a smart, quick and efficient manner.

No comments:

Post a Comment