Concurrency and Parallelism: Threads, Locks, and the Hard Problem

Zusammenfassung

For fifty years, programmers got faster software for free: chips got faster, and programs ran faster without changing a line. Around 2005 that bargain ended. Processors stopped getting much faster and started getting more numerous — multiple cores on a chip — and the only way to use them was to make programs do many things at once. Concurrency, long a specialist concern of operating-system and database authors, became everyone’s problem. It is among the hardest problems in the field, because the human mind reasons about one thing at a time and computers now insist on doing several. This is the story of threads and locks, the bugs they breed, and the decades-long search for a way to think about doing more than one thing at once that humans can actually get right.

Concurrency Is Not Parallelism

The two words are constantly confused, and the distinction matters.

Concurrency is a way of structuring a program as multiple independent tasks that can make progress in overlapping time periods. It is about dealing with many things at once. A single-core computer running a web server handles thousands of connections concurrently by rapidly switching between them — none truly simultaneous, but all in progress.

Parallelism is the execution of multiple computations at literally the same instant, which requires multiple processing units. It is about doing many things at once.

Rob Pike’s formulation is the standard one: concurrency is about structure, parallelism is about execution. A program can be concurrent without being parallel (one core, many tasks interleaved), parallel without being concurrent (one task split across cores), both, or neither. Concurrency is a software design concern; parallelism is a hardware capability that good concurrent design lets you exploit.

Why It Became Unavoidable: The End of the Free Lunch

Until the mid-2000s, single-processor performance climbed exponentially, driven by rising clock speeds — a benefit of the integrated circuit revolution and the trends behind Moore’s Law. Programmers rode this curve passively: write ordinary sequential code, wait a year, watch it run faster.

Then physics intervened. Rising clock speeds meant rising power consumption and heat, and around 3–4 GHz the heat became unmanageable — the power wall. Chipmakers could still pack more transistors on a die, but they could no longer make a single core meaningfully faster. So they spent the transistors on more cores instead. Herb Sutter named the consequence in his 2005 essay “The Free Lunch Is Over”: from now on, software that wanted to get faster would have to be explicitly concurrent. A sequential program on an eight-core machine uses one-eighth of the hardware. Concurrency stopped being optional.

The Classical Model: Threads and Shared Memory

The dominant model inherited from operating-system design is threads sharing memory. A program spawns multiple threads of execution (see Operating System Concepts), all reading and writing the same memory. It is powerful, efficient, and a minefield.

The core danger is the race condition: when two threads access the same data and at least one writes, the outcome depends on the unpredictable timing of their interleaving. The classic example is two threads each incrementing a shared counter. Incrementing is really three steps — read the value, add one, write it back — and if the threads interleave those steps, both can read the same starting value and one increment is silently lost. The bug is non-deterministic: it may appear once in a million runs, never in testing, and constantly in production. Races are among the hardest bugs in all of software precisely because they refuse to reproduce on demand.

Locks and Their Discontents

The classical fix is mutual exclusion — ensuring only one thread at a time can touch shared data. The tools:

Mutexes (locks) — a thread must acquire the lock before touching shared data and release it after, forcing other threads to wait their turn.
Semaphores — Dijkstra’s 1965 invention, a counter controlling access to a pool of resources (see Edsger Dijkstra and Structured Programming).
Atomic operations — hardware-guaranteed indivisible operations (like an atomic increment) that sidestep the read-modify-write race entirely for simple cases.

But locks introduce their own family of failures:

Deadlock — two threads each hold a lock the other needs, and both wait forever. Dijkstra captured the essence in the Dining Philosophers problem (1965): five philosophers, five forks, each needing two forks to eat, can all grab one and starve together.
Livelock — threads keep responding to each other and changing state but make no actual progress.
Starvation — a thread never gets the resource it needs because others keep jumping the queue.
Priority inversion — a high-priority thread waits on a lock held by a low-priority one; this famously nearly doomed NASA’s 1997 Mars Pathfinder mission, where a low-priority task holding a mutex blocked a high-priority one, tripping a watchdog timer that kept resetting the spacecraft until engineers diagnosed the priority inversion and patched it from Earth by enabling priority inheritance.

The fundamental problem with locks is that they do not compose: two individually correct lock-using components can deadlock when combined, and there is no general way to reason about the combination by reasoning about the parts. This is what makes lock-based concurrency so treacherous.

The Search for Better Models

Because shared-memory threading is so error-prone, much of concurrency’s history is a search for models that are easier to reason about. Several rose to prominence:

Communicating Sequential Processes (CSP). Tony Hoare’s 1978 formalism (see Tony Hoare and the Science of Programming) proposed that concurrent processes should not share memory at all, but communicate by passing messages over channels. “Do not communicate by sharing memory; share memory by communicating.” This is the model behind Go’s goroutines and channels, and it eliminates whole classes of race by removing the shared state that causes them.

The Actor model. Carl Hewitt’s 1973 model takes message-passing further: the world is made of actors, each with private state, communicating only by sending asynchronous messages. No shared memory, no locks. This model powers Erlang, designed at Ericsson for telephone switches needing extreme reliability and concurrency (see The Rise of Functional Programming), and its descendant Elixir and the Akka framework.

Immutability and functional concurrency. If data never changes, there are no write races — many threads can read shared immutable data with no locks at all. This is a core argument for functional programming in the multicore era, and a vindication of John Backus’s 1977 critique of mutable state (see The Compiler).

Software Transactional Memory (STM). Borrowing the transaction concept from databases, STM lets a block of memory operations execute atomically, retrying automatically on conflict — composable concurrency without manual locks. Influential in Haskell and Clojure, though performance limits kept it niche.

Async/await. For I/O-bound concurrency (waiting on networks and disks rather than crunching numbers), the event loop with cooperative async/await syntax became dominant in JavaScript (Node.js), Python, C#, and Rust. A single thread juggles thousands of waiting tasks by switching whenever one blocks — concurrency without the perils of parallel shared-memory access.

The GIL: A Cautionary Compromise

Python’s Global Interpreter Lock (GIL) is concurrency history’s most-debated compromise. To make the CPython interpreter simple and safe, its designers allowed only one thread to execute Python bytecode at a time. This made single-threaded code fast and C extensions easy, but meant Python threads could not achieve true CPU parallelism — a growing embarrassment as cores multiplied. After decades of debate, work to make the GIL optional finally landed in Python 3.13 (2024), illustrating how a reasonable early concurrency decision can become a structural constraint that takes the better part of thirty years to unwind.

Dead End: The Dream of Automatic Parallelization

The most seductive idea in concurrency’s history was that programmers would never have to deal with any of this. The compiler, or the runtime, would take ordinary sequential code and automatically parallelize it across all available cores — the free lunch restored by sufficiently clever tools. Enormous research effort across the 1980s and 1990s, much of it on parallelizing FORTRAN for supercomputers, chased this goal. It largely failed for general-purpose code. Automatic parallelization works well only for regular, predictable patterns — dense numerical loops over arrays — and these are exactly the cases now handled by GPUs (see The GPU Revolution), vector instructions, and specialized libraries. For ordinary programs full of branches, pointers, and unpredictable data dependencies, the compiler cannot find safe parallelism without understanding the programmer’s intent, and that intent is not in the code. The dream died on a hard truth: parallelism is a property of the problem, not just the code, and no amount of compiler cleverness can extract concurrency that the algorithm does not contain. Half a century on, making programs do many things at once correctly remains stubbornly, irreducibly a human problem.