1. High Availability (HA) and Failover
High Availability (HA) refers to designing systems that ensure continuous operation and minimal downtime, even in the face of hardware or software failures.
Failover is the mechanism that automatically transfers control to a standby system upon detecting a failure in the primary system.
Key Concepts:
- Redundancy: Having multiple components ready to take over in case one fails.
- Failover: Automatic or manual switch to a backup system.
- Recovery: Restoring the failed system back to a usable state.
2. Active-Active vs. Active-Passive Configurations
High availability systems typically implement one of two failover configurations:
2.1 Active-Active Configuration
In Active-Active, multiple instances (nodes) run in parallel. They all handle traffic, sharing the load. If one node fails, others continue uninterrupted.
How It Works:
+-----------+ +-----------+ +-----------+
| Server 1 | | Server 2 | | Server 3 |
+-----------+ +-----------+ +-----------+
\ | /
\ | /
+-----------------------+
| Load Balancer |
+-----------------------+
- All nodes are live.
- A load balancer routes traffic evenly.
- Health checks monitor node availability.
Advantages:
- Efficient resource usage (all nodes are active).
- Scales easily with traffic.
- Reduces downtime risk.
Disadvantages:
- Complex data consistency management.
- More expensive due to running multiple active instances.
Use Case Example:
A real-time chat app with millions of users where downtime is unacceptable. Each instance handles messages and user sessions. Load balancer ensures that if one server fails, others continue processing.
2.2 Active-Passive Configuration
In Active-Passive, only one node (active) processes requests. Another node (passive) is on standby and takes over if the active one fails.
How It Works:
+-------------+
| Active Node |
+-------------+
|
+----------------------+
| Clients |
+----------------------+
(Failure Detected)
+-------------+
| Passive Node|
+-------------+
|
+----------------------+
| Clients |
+----------------------+
- The passive node is regularly synchronized with the active one.
- Monitoring tools or heartbeats detect failure and trigger failover.
Advantages:
- Easier to manage consistency.
- Lower cost if a passive node is used only on failover.
Disadvantages:
- Failover time delay.
- Idle resources in normal operation.
Use Case Example:
A relational database using a primary-replica setup. If the primary goes down, the replica is promoted, minimizing downtime and preserving data integrity.
3. Handling Failover and Recovery
3.1 Handling Failover
Failover ensures that the service continues even if a component fails.
Steps in Automatic Failover:
+-------------------+ +-------------------+ +-------------------+
| Health Check | ---> | Detect Failure | ---> | Trigger Failover |
+-------------------+ +-------------------+ +-------------------+
|
+--------------------------+
| Switch to Backup Node |
+--------------------------+
Failover Methods:
- Manual Failover: Requires human intervention.
- Automatic Failover: Uses monitoring and automation tools like:
- Load Balancers (e.g., AWS ELB)
- Cluster managers (e.g., Kubernetes)
- Service Mesh (e.g., Istio)
Python Example: Simulated Automatic Failover Logic
from typing import List
import random
import time
class Node:
def __init__(self, name: str, is_alive: bool = True):
self.name = name
self.is_alive = is_alive
def check_health(self) -> bool:
return self.is_alive
def perform_failover(nodes: List[Node]) -> str:
for node in nodes:
if node.check_health():
return f"Traffic routed to: {node.name}"
return "All nodes are down. Manual intervention required."
# Simulated nodes
primary = Node("Primary", is_alive=False)
backup = Node("Backup", is_alive=True)
print(perform_failover([primary, backup]))
3.2 Handling Recovery
Recovery is the process of bringing failed components back into operation and rebalancing the system.
Steps in Recovery:
+-----------------------+
| Identify Root Cause |
+-----------------------+
|
+-----------------------+
| Fix and Restart |
+-----------------------+
|
+-----------------------+
| Re-synchronize Data |
+-----------------------+
|
+-----------------------+
| Rebalance Load |
+-----------------------+
Recovery Example:
Imagine a PostgreSQL database with streaming replication:
- Failure: The master goes down.
- Promotion: The replica becomes the new master.
- Fix: The original master is repaired.
- Re-sync: The original master becomes the new replica after catching up.
Python Pseudocode to Represent Recovery Steps:
class DatabaseNode:
def __init__(self, role: str):
self.role = role
self.synced = True
def promote(self):
self.role = "primary"
def demote_and_resync(self):
self.role = "replica"
self.synced = True
primary_db = DatabaseNode("primary")
replica_db = DatabaseNode("replica")
# Simulate failure
primary_db = None
# Promote replica
replica_db.promote()
print(f"Replica promoted to: {replica_db.role}")
# Recover old primary
recovered_node = DatabaseNode("recovered_primary")
recovered_node.demote_and_resync()
print(f"Recovered node is now: {recovered_node.role}, synced: {recovered_node.synced}")
Summary
High Availability (HA) ensures systems remain operational with minimal downtime. Failover enables smooth transition from failed components to standby ones. There are two main HA architectures:
ACTIVE-ACTIVE CONFIGURATION
+---------+ +---------+ +---------+
| Node A | | Node B | | Node C |
+---------+ +---------+ +---------+
\ | /
\ | /
+-----------------+
| Load Balancer |
+-----------------+
ACTIVE-PASSIVE CONFIGURATION
+-------------+ +--------------+
| Active Node | ---> | Passive Node |
+-------------+ +--------------+
(Failure) (Promotion)
- Active-Activeimproves performance and resilience, at the cost of complexity.
- Active-Passive is simpler but may introduce failover delays.
Effective failover and recovery processes involve monitoring, automation, redundancy, and re-synchronization. These systems are vital for mission-critical applications and help ensure reliability, performance, and user satisfaction.