1. Consistency Models
In distributed systems, consistency models define the guarantees a system provides about the visibility and ordering of updates across multiple nodes. Choosing a consistency model fundamentally affects system performance, fault tolerance, and availability, creating trade-offs that architects must carefully consider.
1.1 Strong Consistency
Strong Consistency represents the most stringent consistency guarantee, ensuring that all nodes in the system have the latest data at any given time. When a write operation is completed, all subsequent read operations across any node return the most recent data.
How Strong Consistency Works
The mechanism behind strong consistency requires that all replicas be updated after a write operation before acknowledging the write to the client. This synchronous approach ensures data accuracy but introduces latency.
Strong Consistency Flow:
┌─────────┐ Write ┌─────────────┐ Update All ┌─────────────┐
│ Client │ ─────────> │ Primary │ ──────────────> │ Replicas │
└─────────┘ │ Node │ │ (All) │
▲ └─────────────┘ └─────────────┘
│ │ │
│ ▼ ▼
└──── Acknowledgment ────────── Wait for All Updates ────┘
(Only after all
replicas updated)
Banking System Example
Consider a banking system where a user transfers $100 from Account A to Account B. With strong consistency:
- The transfer request is received
- Account A’s balance is reduced by $100
- Account B’s balance is increased by $100
- All replica databases are updated with these changes
- Only then is the transaction confirmed to the user
- Any subsequent read from any node shows the updated balances
Advantages and Trade-offs
Advantages:
- Data Accuracy: Guarantees that all nodes return the latest data, eliminating stale reads
- Simplified Logic: Application developers can reason about data state more easily, as they don’t need to handle inconsistent states
Disadvantages:
- Increased Latency: Updates must propagate to all nodes before completion, adding a significant delay
- Reduced Availability: During network partitions, writes may be blocked to prevent inconsistent states, directly impacting system availability
Strong consistency is ideal for applications where data accuracy is critical, such as financial transactions, account management systems, or inventory management, where overselling must be prevented.
1.2 Eventual Consistency
Eventual Consistency represents a weaker consistency model where the system guarantees that all nodes will eventually converge to the latest state, given enough time and no new updates, but does not ensure immediate consistency across all nodes.
How Eventual Consistency Works
After a write operation, replicas are updated asynchronously. Clients may receive stale data temporarily, but the system will eventually synchronize all nodes. This approach prioritizes availability and performance over immediate consistency.
Eventual Consistency Flow:
┌─────────┐ Write ┌─────────────┐
│ Client │ ─────────> │ Primary │
└─────────┘ │ Node │
▲ └─────────────┘
│ │
└──── Quick Ack ──────────┘
Async Updates:
┌─────────────┐ Eventually ┌─────────────┐
│ Primary │ ──────────────> │ Replica 1 │
│ Node │ └─────────────┘
└─────────────┘ ┌─────────────┐
│ ──────────────> │ Replica 2 │
│ └─────────────┘
│ ┌─────────────┐
└──────────────────────> │ Replica N │
└─────────────┘
Social Media Example
Consider a social media application where a user updates their profile picture:
- The user uploads a new profile picture
- The primary server acknowledges the update immediately
- The new picture is asynchronously replicated to other servers worldwide
- Some users may see the old picture for a short period (seconds to minutes)
- Eventually, all users globally will see the updated picture
Advantages and Trade-offs
Advantages:
- High Availability: Write operations can continue even during network partitions, ensuring the system remains responsive
- Low Latency: Asynchronous replication allows for much quicker response times to clients
Disadvantages:
- Temporary Inconsistencies: Users may experience stale data during the synchronization period
- Complex Application Logic: Applications may need additional logic to handle potential inconsistencies gracefully
Eventual consistency is particularly suitable for applications with high read demands and tolerance for temporary inconsistencies, such as social media feeds, content delivery networks, cache systems, or DNS services.
2. Distributed Consistency Protocols
Distributed consistency protocols are sophisticated algorithms used to achieve consensus across nodes in a distributed system, ensuring data consistency despite node failures or network partitions. These protocols form the backbone of many distributed databases and coordination services.
2.1 Paxos Protocol
Paxos is a foundational consensus algorithm that enables a distributed system to agree on a single value even if some nodes fail. Despite its mathematical rigor and proven correctness, Paxos is notoriously challenging to implement and understand due to its inherent complexity.
Paxos Roles and Architecture
Paxos involves three main roles that nodes can assume:
Paxos Architecture:
┌─────────────┐ Propose ┌─────────────┐ Accept/Reject ┌─────────────┐
│ Proposer │ ────────────> │ Acceptor │ ─────────────────> │ Learner │
│ (Initiates │ │ (Votes on │ │ (Learns the │
│ proposals) │ │ proposals) │ │ final value)│
└─────────────┘ └─────────────┘ └─────────────┘
- Proposer: Initiates proposals and seeks to get a value accepted by a majority of acceptors
- Acceptor: Receives proposals and decides whether to accept or reject them based on proposal numbers
- Learner: Learns the agreed-upon value once consensus is reached among acceptors
Paxos Protocol Phases
The Paxos algorithm operates in three distinct phases:
Phase 1 - Prepare:
Prepare Phase:
Proposer Acceptors (Quorum)
│ │
│─── Prepare(n) ───────────>│
│ │
│<── Promise(n, v_acc) ─────│
│ │
Where:
- n = proposal number
- v_acc = highest accepted value (if any)
The proposer sends a prepare request with a unique proposal number to a quorum of acceptors. Each acceptor responds with a promise not to accept any proposal with a lower number and returns the highest-numbered proposal it has accepted.
Phase 2 - Accept:
Accept Phase:
Proposer Acceptors (Quorum)
│ │
│─── Accept(n, v) ─────────>│
│ │
│<── Accepted(n, v) ────────│
│ │
Where:
- v = proposed value (based on Phase 1 responses)
Based on the prepared responses, the proposer sends an acceptance request with the proposal number and value. Acceptors decide to accept or reject based on their promises.
Phase 3 - Commit: Once a quorum of acceptors agrees on a proposal, the value is committed, and all learners are notified of the consensus.
Paxos in Practice
Google’s Chubby lock service exemplifies Paxos in action. Chubby coordinates distributed resources by achieving consensus on lock ownership, ensuring that only one client can hold a particular lock at any time, even in the presence of network failures or node crashes.
Paxos Trade-offs
Advantages:
- Fault Tolerance: Can tolerate failures of up to (n-1)/2 nodes while still reaching consensus
- Strong Consistency: Ensures that once a value is chosen, it cannot be changed
- Theoretical Foundation: Mathematically proven to be correct under specified assumptions
Disadvantages:
- Implementation Complexity: The multi-phase structure and various edge cases make implementation extremely challenging
- Performance Overhead: Multiple network round-trips can result in higher latency, especially in geographically distributed systems
- Liveness Issues: Under certain network conditions, Paxos may fail to make progress
2.2 Raft Protocol
Raft was designed explicitly to be a more understandable alternative to Paxos while providing similar consistency guarantees. It achieves consensus through a more intuitive approach based on leader election and log replication.
Raft Architecture and Concepts
Raft organizes the consensus problem around three key concepts:
Raft Cluster Structure:
┌─────────────┐ Log Entries ┌─────────────┐
│ Leader │ ────────────────> │ Follower │
│ (Manages │ │ (Replicates │
│ all writes) │ │ leader log) │
└─────────────┘ └─────────────┘
│ ▲
│ Log Entries │
▼ │
┌─────────────┐ ┌─────────────┐
│ Follower │ │ Follower │
│ (Replicates │ │ (Replicates │
│ leader log) │ │ leader log) │
└─────────────┘ └─────────────┘
Raft Protocol Phases
Election Phase:
Leader Election Process:
Normal State: Leader Failure: New Election:
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Leader │ ──X │ Leader │ │Candidate│
└─────────┘ └─────────┘ └─────────┘
│ │ │
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│Follower │ │Follower │ ───────> │ Voter │
└─────────┘ └─────────┘ └─────────┘
When followers detect no leader activity, they transition to candidate state and initiate an election by requesting votes from other nodes.
Replication Phase:
Log Replication:
Leader Followers
┌─────────────┐ ┌─────────────┐
│Log: [1][2] │─────Entry 3 ──>│Log: [1][2] │
│ [3] NEW │ │ [3] NEW │
└─────────────┘ └─────────────┘
│ │
▼ ▼
Wait for majority Send ACK
acknowledgment when written
The leader appends entries to its log and sends them to followers. Entries are only considered committed when a majority of the cluster has acknowledged them.
Commit Phase: Once an entry is replicated to a majority of nodes, the leader commits it and notifies followers to apply it to their state machines.
Raft in Real Systems
Raft is widely adopted in production systems:
- etcd: Kubernetes’ distributed key-value store uses Raft for cluster coordination
- Consul: HashiCorp’s service discovery platform relies on Raft for consistent configuration storage
- CockroachDB: Uses Raft for replicating data across nodes in their distributed SQL database
Raft Trade-offs
Advantages:
- Simplicity: Much easier to understand and implement compared to Paxos
- Strong Consistency: Ensures strong consistency through log replication with majority quorum
- Clear Leadership: The single leader model simplifies client interaction and system reasoning
Disadvantages:
- Leader Bottleneck: All writes must go through the leader, potentially creating a performance bottleneck
- Election Overhead: Leader failures require election processes that can temporarily impact availability
- Communication Requirements: Log replication requires acknowledgment from a quorum, which can slow operations in high-latency networks