Linux and Unix cheatsheet: High Availability

HA

In today’s complex environments, providing continuous service for applications is a key component of a successful IT implementation. High availability is one of the components that contributes to providing ontinuous service for the application clients, by masking or eliminating both planned and unplanned systems and application downtime. This is achieved through the elimination of hardware and software single points of failure (SPOFs).

A high availability solution will ensure that the failure of any component of the solution, either hardware, software, or system management, will not cause the application and its data to be unavailable to the user.
High Availability Solutions should eliminate single points of failure (SPOF) through appropriate design, planning, selection of hardware, configuration of software, and carefully controlled change management discipline.

Downtime

The downtime is the time frame when an application is not available to serve its
clients. We can classify the downtime as:

Planned:
Hardware upgrades
Repairs
Software updates/upgrades
Backups (offline backups)
Testing (periodic testing is required for cluster validation.)
Development

Unplanned:
Administrator errors
Application failures
Hardware failures
Environmental disasters

For a high availability solution you need:
Redundant servers
Redundant networks
Redundant network adapters
Monitoring
Failure detection
Failure diagnosis
Automated fallover
Automated reintegration

High availability versus fault tolerance

Fault-tolerant systems

The systems provided with fault tolerance are designed to operate virtually without interruption, regardless of the failure that may occur (except perhaps for a complete site down due to a natural disaster). In such systems, ALL components are at least duplicated for either software or hardware.

High availability systems

The systems configured for high availability are a combination of hardware and software components configured in such a way to ensure automated recovery in case of failure with a minimal acceptable downtime.

In such systems, the software involved detects problems in the environment, and then provides the transfer of the application on another machine, taking over the identity of the original machine (node).

Thus, it is very important to eliminate all single points of failure (SPOF) in the environment. For example, if the machine has only one network connection, a second network interface should be provided in the same node to take over in case the primary adapter providing the service fails.

Another important issue is to protect the data by mirroring and placing it on shared disk areas accessible from any machine in the cluster.

Types of HA:

Standalone
- Downtime: couple of days
- Data availability: last full backup

Enhance Standalone
- Downtime: couple of hours
- Data availability: last transactions

HA Cluster
- Downtime: depends (few minutes?)
- Data availability: last transactions

Fault Tolerance Computers
- Downtime: no
- Data availability: no data loss

High Availability

HA

Downtime

High availability versus fault tolerance

Fault-tolerant systems

High availability systems

Types of HA:

No comments: