Zum Inhalt springen

Data Structures: The Building Blocks of Computation

Zusammenfassung

“Algorithms + Data Structures = Programs,” Niklaus Wirth titled his 1976 textbook, and the equation has held up better than almost any other claim in computer science. A data structure is a way of organizing information in memory so that it can be used efficiently — and the choice of structure, far more than the cleverness of the code around it, determines whether a program is fast or slow, scalable or doomed. The array, the linked list, the hash table, the tree, the graph: these are the nouns of programming, and the history of computing is in large part the history of finding the right arrangement of bits for each problem.

The Central Idea: Organization Determines Cost

A computer’s memory is, at bottom, a vast array of numbered cells. Everything else is a convention layered on top. A data structure is such a convention — an agreement about how to arrange data in those cells and how to navigate it — chosen to make the operations a program performs most often as cheap as possible.

The key insight, formalized by the field of algorithm analysis, is that the same task can have wildly different costs depending on the structure used. Finding an item in an unsorted list of a million elements might take a million steps; in a hash table, roughly one; in a balanced tree, about twenty. These differences are described with Big-O notation, which expresses how an operation’s cost grows as the data grows: O(1) (constant), O(log n) (logarithmic), O(n) (linear), O(n log n), O(n²), and worse. Choosing a data structure is choosing your Big-O — and at scale, the difference between O(n) and O(log n) is the difference between a program that works and one that doesn’t.

Every data structure embodies trade-offs. There is no universally best one. A structure that makes lookups fast may make insertions slow; one that saves memory may cost speed. The art is matching the structure to the access pattern.

The Primitives: Arrays and Linked Lists

Two structures sit at the foundation, and almost everything else is built from them.

The array is a block of contiguous memory holding elements of the same type. Its superpower is random access: because every element is the same size and laid out end to end, the address of element i is a simple arithmetic calculation — O(1) to reach any element. Its weakness is rigidity: inserting or removing an element in the middle requires shifting everything after it, and a fixed-size array cannot grow. Arrays map directly onto how hardware memory actually works, which makes them not only simple but cache-friendly — a property that matters enormously on modern processors, where reading from memory the CPU has already cached is vastly faster than fetching fresh from RAM.

The linked list takes the opposite trade. Each element (a node) holds its data plus a pointer to the next node, so the list can live scattered across memory. Inserting or removing an element is O(1) if you already hold the spot — just rewire two pointers, no shifting. But there is no random access: reaching element i means walking the chain from the start, O(n), and the scattered layout is hostile to the CPU cache. Linked lists were central to LISP, whose very name is “LISt Processor” and whose cons-cell lists drove the invention of garbage collection (see Garbage Collection).

Stacks, Queues, and Discipline

Some structures are defined not by their layout but by the discipline they impose on access.

A stack is LIFO — last in, first out. You push items on top and pop them off the top. Stacks are everywhere in computing’s plumbing: the call stack that tracks function calls and returns is a stack, which is why a runaway recursion produces a “stack overflow.” Expression evaluation, undo systems, and depth-first search all run on stacks.

A queue is FIFO — first in, first out, like a line at a counter. Queues model anything processed in arrival order: scheduler run queues (see Operating System Concepts), print spoolers, message buffers, and the breadth-first traversal of graphs.

Both are typically built on top of arrays or linked lists; their importance is conceptual — they capture a pattern of use, not a memory layout.

The Hash Table: Average-Case Magic

The hash table (or hash map, dictionary, associative array) is arguably the most consequential data structure in everyday programming. It stores key–value pairs and offers O(1) average-time insertion, deletion, and lookup — find any value by its key in essentially constant time, regardless of how many items the table holds.

It works by feeding the key through a hash function that maps it to an index in an underlying array. Good hash functions scatter keys evenly; collisions (two keys mapping to the same slot) are handled by chaining (a small list per slot) or open addressing (probing for the next free slot). The catch is the qualifier average: a poorly designed hash function or an adversarial set of inputs can degrade a hash table to O(n), a fact that has been weaponized in denial-of-service attacks. The hash table is the engine behind language built-ins like Python’s dict, JavaScript’s objects, and the in-memory indexes of countless databases.

Trees: Hierarchy and Logarithmic Search

A tree organizes data hierarchically: a root node with children, each of which may have children, and so on. Trees turn up wherever data is naturally nested — file systems (see File Systems), the abstract syntax trees compilers build (see The Compiler), the DOM of a web page, organizational charts.

The most important variant is the binary search tree (BST), where each node has at most two children and the left subtree holds smaller keys, the right subtree larger. This ordering allows binary search through the structure — O(log n) lookup — provided the tree stays balanced. An unbalanced BST can degenerate into a glorified linked list, O(n), which is why self-balancing trees were invented:

  • AVL trees (Adelson-Velsky and Landis, 1962) — the first self-balancing BST, kept rigidly balanced for fast lookups.
  • Red–black trees — slightly looser balancing with cheaper updates; the workhorse behind many standard-library ordered maps (C++ std::map, Java TreeMap).
  • B-trees (Bayer and McCreight, 1970) — wide, shallow trees designed for disk and database storage, where the cost of a disk seek dwarfs the cost of comparing keys. The B-tree and its B+-tree variant are the structure underneath nearly every relational database index (see The Database Revolution) and many file systems.

Other specialized trees — the heap (for priority queues, always giving you the smallest or largest element), the trie (for prefix-based string lookup, powering autocomplete), and spatial trees like the quadtree and k-d tree — each tune the basic idea for a particular access pattern.

Graphs: Modeling Relationships

A graph is the most general structure: a set of nodes (vertices) connected by edges. Where a tree is a strict hierarchy, a graph allows arbitrary connections — cycles, many-to-many links, the works. Graphs model the things hierarchies cannot: social networks, road maps, the web’s hyperlink structure, dependency relationships, and computer networks themselves (see Distributed Systems).

Graphs are stored as adjacency lists (each node keeps a list of its neighbors — memory-efficient for sparse graphs) or adjacency matrices (a grid of which nodes connect — fast lookup, memory-hungry). The famous graph algorithms — Dijkstra’s shortest path, breadth-first and depth-first search, PageRank — operate on these representations, and the choice of representation shapes their performance.

The Theoretical Backbone

Data structures are inseparable from the analysis of algorithms, the discipline of reasoning rigorously about how cost scales. Donald Knuth’s multi-volume The Art of Computer Programming (from 1968) is the field’s monumental reference, cataloging structures and their analyses with mathematical precision. The combination of Knuth’s analytical rigor and the practical structures above turned programming from craft toward science: a programmer can now predict, before writing a line, whether a chosen structure will scale to the required size.

Dead End: The Self-Balancing Trade and the Cache Reckoning

For decades, theory ranked data structures purely by their Big-O complexity, and the elegant self-balancing trees reigned as the “correct” answer for ordered data. The modern reckoning has been humbling: Big-O counts operations but ignores that on real hardware, where data sits matters as much as how many steps an algorithm takes. The deep memory hierarchies of modern CPUs — registers, multiple cache levels, RAM, disk, each an order of magnitude slower than the last — mean that a cache-friendly array can crush a pointer-chasing tree or linked list that the textbook declares its equal or superior, because every pointer hop risks a cache miss costing hundreds of cycles. The result has been a quiet rehabilitation of flat, contiguous structures and “cache-oblivious” and “cache-aware” data-structure research, and a recognition that the pristine asymptotic complexity that dominated teaching for half a century was always only half the story. The data structure has not changed; our understanding of what makes one “efficient” has.

📚 Sources