Zum Inhalt springen

Distributed Systems: Consensus, CAP, and the Hard Problems

Zusammenfassung

“A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.” Leslie Lamport’s wry definition captures the essential character of the field: distributed systems are not just programs spread across many machines, they are programs that must keep working while individual machines fail, messages vanish, and clocks disagree about what time it is. Every large-scale service — Google, Amazon, your bank, the system that just processed this page — is a distributed system, and behind their apparent solidity lies a body of theory built on impossibility results, hard-won trade-offs, and consensus algorithms that took the field thirty years to make understandable.

Why Distribute at All

A single computer has limits: finite CPU, memory, and storage, and a single point of failure — when it dies, the service dies. A distributed system spreads computation and data across multiple networked machines to overcome these limits, pursuing three goals:

  • Scalability — handle more load than any single machine could, by adding more machines (horizontal scaling) rather than buying a bigger one (vertical scaling).
  • Fault tolerance — keep running when individual machines fail, by replicating data and work across many nodes so no single failure takes down the whole.
  • Low latency — place data and computation physically near users scattered across the globe.

The catch is that distribution does not merely add complexity — it introduces fundamentally new kinds of problem that simply do not exist on a single machine. These problems are not engineering inconveniences to be polished away; several are mathematically proven to be unsolvable in general.

The Eight Fallacies

In the 1990s, engineers at Sun Microsystems (Peter Deutsch and others) cataloged the Eight Fallacies of Distributed Computing — false assumptions that newcomers reliably make and that reliably cause systems to fail:

  1. The network is reliable.
  2. Latency is zero.
  3. Bandwidth is infinite.
  4. The network is secure.
  5. Topology doesn’t change.
  6. There is one administrator.
  7. Transport cost is zero.
  8. The network is homogeneous.

Every one is false, and every distributed system that ignores them eventually breaks in production. The fallacies endure as the field’s first lesson: the network is not an extension of your computer; it is a hostile, unpredictable medium that loses, delays, duplicates, and reorders your messages.

The Deepest Problem: Agreement Despite Failure

The central technical problem of distributed systems is consensus: getting multiple machines to agree on a single value or a single ordering of events, even when some machines crash and messages are delayed or lost. This sounds trivial and is, in fact, one of the hardest problems in computer science.

It is hard because of a fundamental limit: in an asynchronous network, a machine that has stopped responding is indistinguishable from one that is merely slow. You cannot tell a dead node from a slow one, so you can never safely conclude “that node is gone, let’s proceed without it” — it might come back and disagree.

This intuition was made precise by the FLP impossibility result (Fischer, Lynch, and Paterson, 1985), one of the most celebrated theorems in the field: in an asynchronous system where even a single node may crash, there is no algorithm that can guarantee consensus in bounded time. Perfect consensus with perfect fault tolerance is provably impossible. Real systems get around FLP not by defeating it but by relaxing its assumptions — using timeouts, randomness, or weaker guarantees — to achieve consensus that works in practice even though it cannot be guaranteed in theory.

CAP: The Trade-off You Cannot Escape

The most famous result in distributed systems is the CAP theorem, conjectured by Eric Brewer in 2000 and proved by Gilbert and Lynch in 2002. It states that a distributed data store can provide at most two of these three guarantees simultaneously:

  • Consistency — every read sees the most recent write; all nodes agree on the current data.
  • Availability — every request gets a (non-error) response.
  • Partition tolerance — the system keeps working even when the network splits and nodes cannot all reach each other.

The crucial insight is that network partitions are not optional — in any real distributed system, the network will split sometimes, so partition tolerance is mandatory. That reduces CAP to a stark choice during a partition: when nodes cannot communicate, you must pick. Either refuse to answer until you can guarantee correctness (choose consistency, sacrifice availability — a CP system), or answer with possibly-stale data to stay responsive (choose availability, sacrifice consistency — an AP system).

CAP reshaped how databases were designed and marketed. Traditional relational databases (see The Database Revolution) chose consistency. The NoSQL movement of the late 2000s — Amazon’s Dynamo, Cassandra, Riak — deliberately chose availability with eventual consistency, accepting that replicas might briefly disagree but would converge. Amazon’s 2007 Dynamo paper was the manifesto for this approach, born from the practical reality that for a shopping cart, staying available (never refusing an “add to cart”) mattered more than perfect consistency. CAP is widely criticized today as an oversimplification (the later PACELC formulation adds the latency-vs-consistency trade-off that holds even without partitions), but it remains the field’s most influential mental model.

Making Consensus Practical: Paxos and Raft

Despite FLP, practical consensus algorithms exist — they sidestep the impossibility by using timeouts and assuming the network is usually well-behaved.

Leslie Lamport is the towering figure here. His Paxos algorithm (published 1998, conceived earlier, in a famously whimsical paper framed as the legislative procedure of an ancient Greek parliament) became the canonical consensus protocol — provably correct, and the foundation of Google’s Chubby lock service and Spanner. Paxos was also notoriously, almost legendarily, hard to understand; Lamport’s own papers and a generation of frustrated engineers attested to it. Lamport won the 2013 Turing Award for his foundational work, which also included logical clocks (1978) — a way to order events in a distributed system without synchronized physical clocks, by reasoning about causality (the “happens-before” relation) rather than wall-clock time.

Because Paxos was so impenetrable, Diego Ongaro and John Ousterhout designed Raft (2014) explicitly for understandability, decomposing consensus into clear sub-problems: leader election, log replication, and safety. Raft achieves the same guarantees as Paxos but can actually be taught and implemented correctly, and it rapidly became the consensus engine inside systems like etcd (the heart of Kubernetes), Consul, and CockroachDB. The arrival of Raft is a quiet milestone: it marks the point where distributed consensus became something an ordinary engineer could implement without a doctorate.

Time, Order, and Replication

Because there is no global clock and no instant communication, ordering events across machines is a deep problem. Lamport’s logical clocks and vector clocks let systems establish causal ordering. Google’s Spanner (2012) took a radical hardware approach: it equipped data centers with GPS receivers and atomic clocks (the TrueTime API) to bound clock uncertainty tightly enough to offer globally consistent transactions across continents — buying its way past a theoretical limit with better physics.

Replication — keeping copies of data on multiple nodes — is how distributed systems achieve both fault tolerance and low latency, and it is where consistency models bite. The spectrum runs from strong consistency (every replica always agrees, slow and partition-intolerant) through many intermediate models to eventual consistency (replicas converge over time, fast and available). Choosing where to sit on this spectrum is the central design decision of any distributed data system.

The Building Blocks of Modern Scale

The theory underpins the infrastructure that runs the modern internet. MapReduce and the systems of the Big Data revolution distributed computation across thousands of commodity machines. Sharding partitions data across nodes; consistent hashing (from the Dynamo lineage) distributes it while minimizing reshuffling when nodes join or leave. Microservices (see The Microservices Revolution) decompose applications into independently deployed distributed components. Container orchestration with Kubernetes — itself coordinated by a Raft-based store — schedules distributed workloads. The cloud is, in the end, distributed-systems theory rented by the hour. Even blockchain is a distributed-systems story: a consensus protocol (Nakamoto consensus) designed to work among mutually distrustful nodes, trading efficiency for the removal of any central authority.

Dead End: The Dream of Distribution Transparency

The most persistent dead end in distributed computing was the belief that distribution could be made invisible — that programmers could write code as if calling a function on a remote machine were the same as calling one locally, and the system would hide the network entirely. This was the promise of RPC (Remote Procedure Call) in the 1980s and, most ambitiously, of frameworks like CORBA and Java RMI in the 1990s: a remote object would look exactly like a local one.

It was a mirage, and the Eight Fallacies are its tombstone. A local call cannot fail because the network partitioned; a remote one can. A local call takes nanoseconds; a remote one takes milliseconds and may never return. Pretending otherwise produced systems that were slow, fragile, and impossible to reason about, because the failure modes the abstraction hid were exactly the ones that mattered. The influential 1994 paper “A Note on Distributed Computing” (Waldo et al. at Sun) demolished the premise: the hard differences between local and remote — latency, partial failure, concurrency, memory access — cannot be papered over, and any system that tries will be defeated by the realities it ignored. The field’s maturation came from accepting the opposite principle: distribution must be made explicit, its failures surfaced and handled, not hidden. You cannot abstract away the network. You can only learn to live with it.

📚 Sources