Topics

Leader Election

1. Introduction to Leader Election

Leader election is a fundamental coordination mechanism in distributed systems that designates a single node as the “leader” to coordinate actions, assign tasks, or make critical decisions. This pattern is essential for maintaining consistency and avoiding conflicts in scenarios where multiple nodes might otherwise compete for the same resources or responsibilities.

Leader election becomes crucial in various distributed system scenarios:

Consensus Management: Ensuring only one node makes critical decisions at a time
Task Distribution: Having a single coordinator assign work to worker nodes
Configuration Management: Maintaining a single source of truth for system configuration
Resource Allocation: Preventing resource conflicts by centralizing allocation decisions
Failure Recovery: Coordinating system recovery after failures

The challenge lies in ensuring all nodes agree on the same leader while handling network partitions, node failures, and varying message delivery times. Different algorithms approach this challenge with varying trade-offs between simplicity, network efficiency, and fault tolerance.

2. Leader Election Algorithms

2.1 Bully Algorithm

The Bully Algorithm is one of the most straightforward leader election algorithms, operating on the principle that the node with the highest unique identifier should become the leader. Its name comes from the way higher-ID nodes “bully” lower-ID nodes out of leadership contention.

How the Bully Algorithm Works

The algorithm operates through a series of message exchanges that ensure the highest-ID active node becomes the leader:

Bully Algorithm Flow:

Initial State (Node 3 detects leader failure):
┌─────┐    ┌─────┐    ┌─────┐    ┌─────┐    ┌─────┐
│ N1  │    │ N2  │    │ N3  │    │ N4  │    │ N5  │
│ID:1 │    │ID:2 │    │ID:3 │    │ID:4 │    │ID:5 │
└─────┘    └─────┘    └─────┘    └─────┘    └─────┘
                         ▲                     ▲
                         │                     │
                    Initiator            Failed Leader
Step 1 - Election Messages:
Node 3 sends election messages to higher-ID nodes
┌─────┐    ┌─────┐    ┌─────┐    ┌─────┐    ┌─────┐
│ N1  │    │ N2  │    │ N3  │    │ N4  │    │ N5  │
│ID:1 │    │ID:2 │    │ID:3 │    │ID:4 │    │ID:5 │
└─────┘    └─────┘    └─────┘    └─────┘    └─────┘
                         │          ▲         ▲
                         │   Election│  Election
                         └──────────────────────┘
Step 2 - Response from Higher-ID Nodes:
Node 4 responds, Node 5 is down
┌─────┐    ┌─────┐    ┌─────┐    ┌─────┐    ┌─────┐
│ N1  │    │ N2  │    │ N3  │    │ N4  │    │ N5  │
│ID:1 │    │ID:2 │    │ID:3 │    │ID:4 │    │ID:5 │
└─────┘    └─────┘    └─────┘    └─────┘    └─────┘
                         ▲          │         X
                         │    OK    │      (Down)
                         └──────────┘
Step 3 - Node 4 Initiates Its Own Election:
Node 4 sends election to Node 5
┌─────┐    ┌─────┐    ┌─────┐    ┌─────┐    ┌─────┐
│ N1  │    │ N2  │    │ N3  │    │ID:4 │    │ N5  │
│ID:1 │    │ID:2 │    │ID:3 │    │ID:4 │    │ID:5 │
└─────┘    └─────┘    └─────┘    └─────┘    └─────┘
                                    │         ▲
                                    │ Election│
                                    └─────────┘
Step 4 - No Response, Node 4 Declares Victory:
Node 4 broadcasts coordinator message
┌─────┐    ┌─────┐    ┌─────┐    ┌─────┐    ┌─────┐
│ N1  │    │ N2  │    │ N3  │    │ N4  │    │ N5  │
│ID:1 │    │ID:2 │    │ID:3 │    │LEADER│   │ID:5 │
└─────┘    └─────┘    └─────┘    └─────┘    └─────┘
   ▲          ▲          ▲          │         X
   │Coordinator│Coordinator│Coordinator│      (Down)
   └──────────────────────────────────┘

Detailed Algorithm Steps

Failure Detection: A node detects that the current leader has failed (no heartbeat, failed request, etc.)
Election Initiation: The detecting node sends “ELECTION” messages to all nodes with higher IDs
Response Handling:

If any higher-ID node responds, the initiating node steps back and waits
If no higher-ID node responds, the initiating node proceeds to declare itself the leader

Leadership Declaration: The new leader broadcasts a “COORDINATOR” message to all nodes with lower IDs
Acknowledgment: Lower-ID nodes acknowledge the new leader and update their leader information

Real-World Example: Database Cluster

Consider a database cluster with 5 nodes, where Node 5 (the highest ID) was the primary:

Database Cluster Scenario:
┌─────────────┐  ┌─────────────┐  ┌─────────────┐
│  Database   │  │  Database   │  │  Database   │
│   Node 1    │  │   Node 2    │  │   Node 3    │
│   (ID: 1)   │  │   (ID: 2)   │  │   (ID: 3)   │
│  Secondary  │  │  Secondary  │  │  Secondary  │
└─────────────┘  └─────────────┘  └─────────────┘
┌─────────────┐  ┌─────────────┐
│  Database   │  │  Database   │
│   Node 4    │  │   Node 5    │
│   (ID: 4)   │  │   (ID: 5)   │
│  Secondary  │  │   PRIMARY   │ ← Fails
└─────────────┘  └─────────────┘

When Node 5 fails:

Node 3 detects the failure first
Node 3 sends election messages to Nodes 4 and 5
Node 4 responds, Node 5 doesn’t (it’s down)
Node 4 initiates its election to Node 5
No response from Node 5, so Node 4 becomes the new primary
Node 4 broadcasts its leadership to all other nodes

Advantages and Disadvantages

Advantages:

Simplicity: Extremely easy to understand and implement
Deterministic: Always selects the highest ID available node
Quick Recovery: Fast leader election in small clusters
No Complex State: Minimal state management required

Disadvantages:

High Network Traffic: O(n²) messages in the worst case for n nodes
Scalability Issues: Performance degrades significantly with cluster size
Potential Instability: Continuous elections if the highest-ID node keeps failing
Network Partition Sensitivity: Can result in split-brain scenarios

2.2 Raft Algorithm (Leader Election)

Raft’s leader election is part of a broader consensus algorithm designed to be understandable while providing strong consistency guarantees. Unlike the Bully Algorithm’s ID-based approach, Raft uses majority voting with randomized timeouts to prevent election conflicts.

How Raft Leader Election Works

Raft nodes exist in one of three states: Leader, Follower, or Candidate. The election process uses terms (logical time) and majority consensus:

Raft Leader Election States:
┌─────────────┐    Timeout    ┌─────────────┐    Majority  ┌─────────────┐
│  Follower   │ ────────────> │  Candidate  │ ───────────> │   Leader    │
│             │               │             │     Votes    │             │
└─────────────┘               └─────────────┘              └─────────────┘
       ▲                             │                             │
       │          Lost Election      │                             │
       └─────────────────────────────┘                             │
       ▲                                                           │
       │                     Heartbeat                             │
       └───────────────────────────────────────────────────────────┘

Detailed Raft Election Process

Step 1 - Election Timeout

Normal Operation with Heartbeats:
┌─────────┐    Heartbeat   ┌─────────┐    Heartbeat    ┌─────────┐
│Follower │ <────────────  │ Leader  │ ──────────────> │Follower │
│ Node A  │                │ Node B  │                 │ Node C  │
└─────────┘                └─────────┘                 └─────────┘
Leader Failure - Election Timeout:
┌─────────┐    No Heartbeat  ┌─────────┐    No Heartbeat ┌─────────┐
│Follower │                  │ Leader  │                 │Follower │
│ Node A  │                  │ Node B  │ ← Failed        │ Node C  │
│Timeout! │                  └─────────┘                 │Timeout! │
└─────────┘                                              └─────────┘

Step 2 - Candidate State and Vote Request

Candidate Election:
┌─────────┐    Vote Request   ┌─────────┐    Vote Request  ┌─────────┐
│Candidate│ ─────────────────>│Follower │<──────────────── │Candidate│
│ Node A  │                   │ Node D  │                  │ Node C  │
│ Term: 5 │                   │         │                  │ Term: 5 │
└─────────┘                   └─────────┘                  └─────────┘
     │                             │                             │
     │                             ▼                             │
     │                    ┌─────────────┐                        │
     └────────────────────│  Votes for  │────────────────────────┘
                          │  Node A     │
                          │ (First Req) │
                          └─────────────┘

Step 3 - Majority Achievement

Election Victory:
┌─────────┐                ┌─────────┐                ┌─────────┐
│ Leader  │                │Follower │                │Follower │
│ Node A  │                │ Node D  │                │ Node C  │
│ Term: 5 │                │         │                │         │
└─────────┘                └─────────┘                └─────────┘
     │                          ▲                          ▲
     │      Heartbeat           │                          │
     └──────────────────────────┼──────────────────────────┘
                                │
                         ┌─────────────┐
                         │   Node E    │
                         │  Follower   │
                         └─────────────┘

Raft Election Timing and Randomization

A critical aspect of Raft is the use of randomized election timeouts to prevent split votes:

# Pseudocode for Raft Election Timeout
import random
import time

class RaftNode:
    def __init__(self, node_id):
        self.node_id = node_id
        self.state = "FOLLOWER"
        self.current_term = 0
        self.voted_for = None
        self.election_timeout = self.generate_election_timeout()
        self.last_heartbeat = time.time()
    
    def generate_election_timeout(self):
        # Randomized timeout between 150-300ms
        return random.uniform(0.15, 0.3)
    
    def check_election_timeout(self):
        if time.time() - self.last_heartbeat > self.election_timeout:
            self.start_election()
    
    def start_election(self):
        self.state = "CANDIDATE"
        self.current_term += 1
        self.voted_for = self.node_id
        self.reset_election_timeout()
        
        # Send vote requests to all other nodes
        votes_received = 1  # Vote for self
        for node in other_nodes:
            if node.request_vote(self.current_term, self.node_id):
                votes_received += 1
        
        # Check if majority achieved
        if votes_received > len(all_nodes) // 2:
            self.become_leader()
        else:
            self.become_follower()

Advantages and Disadvantages

Advantages:

Fault Tolerance: Can handle up to (N-1)/2 node failures
Majority Consensus: Prevents split-brain scenarios
Randomized Timeouts: Reduces election conflicts
Strong Consistency: Guarantees linearizable operations
Proven Algorithm: Extensively tested and formally verified

Disadvantages:

Network Dependent: Requires a reliable network for optimal performance
Complexity: More complex than simple algorithms like Bully
Potential Delays: The Election process can take multiple rounds
Resource Overhead: Requires maintaining the election state and heartbeats

2.3 Ring-Based Election Algorithm

The Ring-Based Election Algorithm organizes nodes in a logical ring topology where each node knows its successor. This approach reduces message complexity by having election messages travel around the ring rather than broadcasting to all nodes.

How Ring-Based Election Works

The algorithm passes an election token around the ring, collecting node IDs and ultimately determining the highest-ID node as the leader:

Ring Topology Setup:
┌─────────┐     ┌─────────┐     ┌─────────┐
│ Node 1  │────>│ Node 2  │────>│ Node 3  │
│ ID: 101 │     │ ID: 205 │     │ ID: 150 │
└─────────┘     └─────────┘     └─────────┘
     ▲                               │
     │           ┌─────────┐         │
     └───────────│ Node 5  │<────────┘
                 │ ID: 175 │
                 └─────────┘
                      ▲
                      │
                 ┌─────────┐
                 │ Node 4  │
                 │ ID: 300 │
                 └─────────┘

Ring Election Process

Step 1 - Election Initiation

Node 2 detects leader failure and starts election:
┌─────────┐     ┌─────────┐     ┌─────────┐
│ Node 1  │     │ Node 2  │────>│ Node 3  │
│ ID: 101 │     │ID: 205  │     │ ID: 150 │
│         │     │INITIATOR│     │         │
└─────────┘     └─────────┘     └─────────┘
                      │               │
              Election Token:         │
              [205]                   │
                                      ▼
Election Token: [205] ─────────────────────┐
                                           │
     ▲           ┌─────────┐         ┌─────▼───┐
     │           │ Node 5  │         │ Node 4  │
     │           │ ID: 175 │         │ ID: 300 │
     │           └─────────┘         └─────────┘
     │                ▲                   │
     │                │                   │
     └────────────────┘                   │
                                          │
Election Token: [205, 150, 300, 175] <────┘

Step 2 - Token Circulation

Token travels around ring collecting IDs:
Round 1: Node 2 → Node 3
Token: [205] → [205, 150]

Round 2: Node 3 → Node 4  
Token: [205, 150] → [205, 150, 300]

Round 3: Node 4 → Node 5
Token: [205, 150, 300] → [205, 150, 300, 175]

Round 4: Node 5 → Node 1
Token: [205, 150, 300, 175] → [205, 150, 300, 175, 101]

Round 5: Node 1 → Node 2 (Complete Circle)
Token: [205, 150, 300, 175, 101] (Back to initiator)

Step 3 - Leader Determination

Node 2 receives token back and determines highest ID:
┌─────────┐     ┌─────────┐     ┌─────────┐
│ Node 1  │     │ Node 2  │     │ Node 3  │
│ ID: 101 │     │ ID: 205 │     │ ID: 150 │
│         │     │ANALYZER │     │         │
└─────────┘     └─────────┘     └─────────┘
                      │
              Max ID: 300 (Node 4)
              
Leader Announcement: Node 4 is Leader

Step 4 - Leader Announcement

Leader announcement propagates around ring:
┌─────────┐     ┌─────────┐     ┌─────────┐
│ Node 1  │<────│ Node 2  │────>│ Node 3  │
│ ID: 101 │     │ ID: 205 │     │ ID: 150 │
│         │     │         │     │         │
└─────────┘     └─────────┘     └─────────┘
     ▲                               │
     │           ┌─────────┐         │
     │           │ Node 5  │<────────┘
     │           │ ID: 175 │
     │           └─────────┘
     │                ▲
     │           ┌─────────┐
     └───────────│ Node 4  │
                 │ ID: 300 │
                 │ LEADER  │
                 └─────────┘
Leader: Node 4 ───────────────────────>

Practical Implementation

class RingElectionNode:
    def __init__(self, node_id, successor):
        self.node_id = node_id
        self.successor = successor
        self.is_leader = False
        self.leader_id = None
        self.election_in_progress = False
    
    def start_election(self):
        if self.election_in_progress:
            return
        
        print(f"Node {self.node_id} starting election")
        self.election_in_progress = True
        
        # Send election message with own ID
        election_message = {
            'type': 'ELECTION',
            'candidates': [self.node_id],
            'initiator': self.node_id
        }
        
        self.send_to_successor(election_message)
    
    def handle_election_message(self, message):
        if message['type'] == 'ELECTION':
            if message['initiator'] == self.node_id:
                # Election completed - determine leader
                candidates = message['candidates']
                new_leader = max(candidates)
                
                # Announce new leader
                leader_message = {
                    'type': 'LEADER',
                    'leader_id': new_leader,
                    'initiator': self.node_id
                }
                
                self.send_to_successor(leader_message)
            else:
                # Add own ID and forward
                message['candidates'].append(self.node_id)
                self.send_to_successor(message)
        
        elif message['type'] == 'LEADER':
            if message['initiator'] == self.node_id:
                # Leader announcement completed
                self.election_in_progress = False
            else:
                # Update leader and forward
                self.leader_id = message['leader_id']
                self.is_leader = (self.node_id == self.leader_id)
                self.send_to_successor(message)
    
    def send_to_successor(self, message):
        # Send message to next node in ring
        self.successor.handle_election_message(message)

Advantages and Disadvantages

Advantages:

Reduced Network Traffic: O(n) messages instead of O(n²)
Ordered Process: Systematic traversal ensures all nodes participate
Simple Logic: Straightforward token-passing mechanism
Guaranteed Termination: Ring structure ensures the election completes

Disadvantages:

Topology Dependency: Requires a ring structure, which may not be natural
Single Point of Failure: A Broken ring link can halt the election
Higher Latency: Must traverse the entire ring before completion
Network Partition Vulnerability: Ring breaks can prevent elections

3. Coordination Services: ZooKeeper and etcd

3.1 ZooKeeper for Leader Election

Apache ZooKeeper is a centralized coordination service that provides distributed synchronization through a hierarchical namespace of data nodes (znodes). It’s particularly well-suited for leader election due to its ephemeral sequential znodes feature.

ZooKeeper Architecture for Leader Election

ZooKeeper Namespace for Leader Election:
/election
├── leader-0000000001  (ephemeral sequential)
├── leader-0000000002  (ephemeral sequential)  
├── leader-0000000003  (ephemeral sequential)
└── leader-0000000004  (ephemeral sequential)

ZooKeeper Cluster:
┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│ ZooKeeper   │    │ ZooKeeper   │    │ ZooKeeper   │
│   Server    │<═══│   Server    │═══>│   Server    │
│ (Follower)  │    │  (Leader)   │    │ (Follower)  │
└─────────────┘    └─────────────┘    └─────────────┘

Application Nodes:
┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   App       │    │   App       │    │   App       │
│  Node 1     │    │  Node 2     │    │  Node 3     │
│             │    │             │    │             │
└─────────────┘    └─────────────┘    └─────────────┘
       │                   │                   │
       └───────────────────┼───────────────────┘
                           │
                    ┌─────────────┐
                    │ ZooKeeper   │
                    │  Ensemble   │
                    └─────────────┘

ZooKeeper Leader Election Process

Step 1 - Node Registration

import kazoo
from kazoo.client import KazooClient
import time

class ZooKeeperLeaderElection:
    def __init__(self, zk_hosts, election_path, node_id):
        self.zk = KazooClient(hosts=zk_hosts)
        self.election_path = election_path
        self.node_id = node_id
        self.node_path = None
        self.is_leader = False
        
    def start(self):
        self.zk.start()
        
        # Ensure election path exists
        self.zk.ensure_path(self.election_path)
        
        # Create ephemeral sequential node
        self.node_path = self.zk.create(
            f"{self.election_path}/leader-",
            value=self.node_id.encode(),
            ephemeral=True,
            sequence=True
        )
        
        self.check_leadership()
    
    def check_leadership(self):
        children = self.zk.get_children(self.election_path)
        children.sort()  # Sort by sequence number
        
        # Extract sequence number from our path
        our_sequence = self.node_path.split('/')[-1]
        
        if children[0] == our_sequence:
            # We are the leader
            self.become_leader()
        else:
            # Watch the predecessor
            predecessor_index = children.index(our_sequence) - 1
            predecessor_path = f"{self.election_path}/{children[predecessor_index]}"
            
            # Set watch on predecessor
            self.zk.exists(predecessor_path, watch=self.predecessor_deleted)
    
    def predecessor_deleted(self, event):
        # Predecessor has left, check if we're now the leader
        self.check_leadership()
    
    def become_leader(self):
        self.is_leader = True
        print(f"Node {self.node_id} became leader")
        # Start leader responsibilities
        self.leader_duties()
    
    def leader_duties(self):
        # Implement leader-specific logic
        while self.is_leader:
            print(f"Leader {self.node_id} is working...")
            time.sleep(5)

Step 2 - Leadership Determination

ZooKeeper Election Sequence:
Initial State:
/election
├── leader-0000000001 (Node A)
├── leader-0000000002 (Node B)  
├── leader-0000000003 (Node C)
└── leader-0000000004 (Node D)

Node A (lowest sequence) = LEADER
Nodes B, C, D watch their predecessors

Leader Failure Scenario:
Node A fails → ephemeral node deleted
/election
├── leader-0000000002 (Node B) ← Now lowest
├── leader-0000000003 (Node C)
└── leader-0000000004 (Node D)

Node B becomes new leader
Node C watches Node B
Node D watches Node C

Advantages and Disadvantages of ZooKeeper

Advantages:

Automatic Failover: Ephemeral nodes provide instant failure detection
Ordered Elections: Sequential znodes ensure fair ordering
High Consistency: ZooKeeper’s consensus ensures all nodes see the same state
Battle-Tested: Used in production by many large-scale systems
Watch Mechanism: Efficient notifications for state changes

Disadvantages:

External Dependency: Requires a separate ZooKeeper cluster
Network Latency: ZooKeeper operations add network round-trips
Complexity: Requires understanding ZooKeeper concepts and maintenance
Single Point of Failure: The ZooKeeper cluster itself needs to be highly available

3.2 etcd for Leader Election

etcd is a distributed key-value store that uses the Raft consensus algorithm internally. It provides leader election capabilities through its lease mechanism and atomic operations.

etcd Leader Election Architecture

etcd Cluster with Raft:
┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   etcd      │    │   etcd      │    │   etcd      │
│  Server     │<═══│  Server     │═══>│  Server     │
│ (Follower)  │    │ (Leader)    │    │ (Follower)  │
└─────────────┘    └─────────────┘    └─────────────┘

Application Leader Election:
┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   App       │    │   App       │    │   App       │
│  Node 1     │    │  Node 2     │    │  Node 3     │
│ (Candidate) │    │ (Candidate) │    │ (Candidate) │
└─────────────┘    └─────────────┘    └─────────────┘
       │                   │                   │
       └───────────────────┼───────────────────┘
                           │
                    ┌─────────────┐
                    │    etcd     │
                    │   Cluster   │
                    └─────────────┘

← Distributed Transactions Data Replication and Consistency →

Track your progress

Mark this subtopic as completed when you finish reading.