Edgar Codd and the Relational Model

Zusammenfassung

Edgar F. Codd published twelve pages in 1970 that reorganized the entire database industry — an industry that did not yet exist in the form he was about to create. His paper, “A Relational Model of Data for Large Shared Data Banks,” gave databases a mathematical foundation: tables, keys, and a set of logical operations sufficient to answer any question about stored data. The company he worked for read the paper, built a prototype to verify it worked, published the results, and then refused to ship a product for another seven years — long enough for a startup in California to read those same IBM research papers and build the database empire that IBM should have owned. Codd received the Turing Award in 1981. Oracle’s Larry Ellison received billions of dollars. The relational model underpins virtually every transaction processed in the world economy. Codd died in 2003, largely unknown outside the field he created.

The Mathematician in the Fog of War

Edgar Frank Codd was born on August 19, 1923, in Fortuneswell on the Isle of Portland, Dorset — a rocky peninsula jutting into the English Channel, better known for the limestone that built St. Paul’s Cathedral than for producing computer scientists. His father was a builder; his mother a teacher. He was the youngest of seven children. The England of his childhood was one of economic depression and approaching war, and Codd learned early to extract certainty from uncertain circumstances — a habit of mind that would eventually produce his most important work.

He studied mathematics at Exeter College, Oxford, completing his BA in 1948. But the path to Oxford had run through combat. Codd served in the Royal Air Force from 1942 to 1945 as a flight lieutenant and pilot, flying Coastal Command anti-submarine patrols over the Atlantic. The work was tedious, dangerous, and consequential: finding and sinking German U-boats required patience, precision, and the ability to act correctly on incomplete information. These were not irrelevant skills for a future database theorist.

After Oxford, Codd moved into the emerging field of computing. In 1949, he joined IBM in New York as a mathematical programmer — one of a tiny cohort of mathematicians who had found their way into the nascent computing industry. His early IBM work included contributions to early Selective Sequence Electronic Calculator (SSEC) programming and later to IBM’s work on autoprogramming systems. By the late 1950s, he had decided to pursue formal academic credentials in the new field. He completed a PhD in communication sciences at the University of Michigan in 1965, then relocated to IBM’s San Jose Research Laboratory in California.

San Jose was not IBM’s headquarters and not its most prestigious research division. It was, for that reason, a place where unconventional thinking was possible.

The Problem Nobody Had Named

In the late 1960s, storing and retrieving data on a computer meant navigating one of two inherited structures.

IBM’s Information Management System (IMS), built for NASA’s Apollo program to track Saturn V rocket parts, organized data as a tree: customer at the root, orders as children, line items beneath them. Navigating the tree was fast if you knew where you were going. Answering an unexpected question — “which customers in three states ordered a product that is now out of stock?” — required writing a custom program, by a programmer, that traversed the tree in a manner specific to that question. Every new business question was a programming project.

The network model (CODASYL, 1969) added more flexibility by allowing records to have multiple parent records, but it introduced a new kind of complexity: programs had to manage explicit pointer traversal through a data graph. The logic for accessing data was tangled throughout application code, which meant that any change to the data structure — adding an index, reorganizing files to improve performance — could silently break programs that used it.

Codd saw the common pathology: physical data dependence. The way data was stored on disk determined how it could be queried, and applications were written against the storage structure rather than against the data itself. This was, from a mathematical standpoint, backwards. The logical content of the data should be independent of its physical representation. Programs should be able to express what they wanted — not how to traverse a pointer chain to get it.

He spent several years developing a theoretical framework that separated the two. The result was submitted to the Communications of the ACM in the spring of 1970.

“A Relational Model of Data for Large Shared Data Banks” (1970)

The paper ran to twelve pages. Its argument was built on mathematics that Codd’s audience largely did not know — set theory and predicate logic — applied to a problem they knew very well. This combination made it simultaneously rigorous and difficult to dismiss.

Codd proposed organizing data as relations: tables of rows and columns, each row representing a single fact, each column a single attribute. A CUSTOMERS table had one row per customer; an ORDERS table had one row per order. The relationship between a customer and their orders was represented not by a pointer or a tree path but by a shared value — a foreign key — that appeared in both tables.

CUSTOMERS: (customer_id, name, city)
ORDERS:    (order_id, customer_id, date, total)

To find all orders for a customer in Seattle, you did not traverse a pointer chain. You asked a logical question:

SELECT all rows from ORDERS where customer_id matches
a customer_id from CUSTOMERS where city = 'Seattle'

This was not a program. It was a predicate — a logical statement about the data — and the system was responsible for figuring out the most efficient way to execute it. The programmer stated what they wanted; the database decided how to get it.

Codd formalized the operations on relations as relational algebra:

Select: filter rows that satisfy a condition
Project: choose a subset of columns
Join: combine two tables on a shared key value
Union, intersection, difference: set operations on relations

He proved that these operations were relationally complete — sufficient to express any query that could be expressed in first-order predicate logic about the data. This completeness result mattered: it meant the model was not a clever design with hidden gaps. It was mathematically sufficient.

Physical vs. Logical Independence

Codd distinguished two kinds of data independence that the relational model provided:

Physical independence: the storage engine can be reorganized — indexes added, files restructured, storage formats changed — without any change to queries or applications.
Logical independence: the schema can be extended — new tables added, new columns introduced — without breaking existing queries.

These properties are why relational databases became the default infrastructure for enterprise computing. Before the relational model, a schema change was a crisis: every application that touched the affected data needed modification. After it, applications could be written against a stable logical interface, insulated from physical reality. This absorption of change is the feature that made relational databases economically indispensable for the next fifty years.

Normalization: Eliminating Redundancy by Mathematics

The 1970 paper established the relational model. Over the following four years, Codd extended it with a second contribution of equal importance: normalization theory — a formal method for designing relational schemas that eliminated data redundancy and prevented certain classes of update anomalies.

The problem normalization addressed was subtle. Even with relations instead of hierarchies, a badly designed schema could still produce inconsistencies. If a single fact — say, a supplier’s address — appeared in multiple rows, updating it in one place and forgetting another produced contradictory data. Normalization prescribed a sequence of design transformations that eliminated these redundancies:

First Normal Form (1NF): every attribute contains atomic values (no repeating groups within a row). Defined in the 1970 paper.
Second Normal Form (2NF): every non-key attribute is fully functionally dependent on the entire primary key (not just part of it). 1971.
Third Normal Form (3NF): no non-key attribute is transitively dependent on the primary key through another non-key attribute. 1971.
Boyce-Codd Normal Form (BCNF): a stricter variant of 3NF, developed jointly with Raymond Boyce (one of the SQL designers) in 1974.

Normalization gave database designers a methodology, not just a model. It answered the question “how do I know if my schema is correct?” with mathematical criteria. Database design could be checked, not just argued about.

The Practical Tradeoff

Normalization reduces redundancy and prevents anomalies; it also requires more joins to reconstruct data that was physically separated. Fully normalized schemas are theoretically elegant and practically slow at scale. The history of database practice is largely the history of deciding how far to normalize — a negotiation between Codd’s mathematical ideal and the performance realities of spinning disk. Denormalization (intentionally violating normal form for performance) became a standard technique, and the NoSQL movement of the 2000s can be read as a systematic rejection of normalization for high-volume workloads. Codd’s theory remained the reference point even for those departing from it.

IBM and the Institutional Obstacle

Codd published his 1970 paper as an IBM employee, using IBM’s name and institutional affiliation. IBM recognized its significance: in 1973, the company launched System R, a multi-year research project at the San Jose laboratory to build a prototype relational database system and prove the model was practically viable at commercial scale.

System R succeeded. By the late 1970s, it demonstrated that relational queries over large datasets ran at commercially acceptable speeds. Two researchers on the project, Donald Chamberlin and Raymond Boyce, designed a query language — originally called SEQUEL (Structured English QUEry Language), later renamed SQL — that expressed relational algebra in syntax readable by non-programmers.

IBM published detailed technical papers about System R’s architecture, query optimization algorithms, and SQL design throughout the late 1970s. These papers were read widely in the research community — and, fatefully, by people outside it.

IBM, however, did not ship a product. The reason was institutional rather than technical. IBM’s Information Management System (IMS) — its hierarchical database — was the backbone of its most profitable customer relationships. Large banks, insurance companies, and government agencies had built their operations on IMS and paid IBM substantial ongoing fees for it. Shipping a relational database that was genuinely superior would accelerate the obsolescence of IMS, risking revenue from IBM’s largest accounts. The organization that had the most to gain technically had the most to lose commercially.

IBM’s DB2 shipped in 1983 — ten years after System R began, thirteen years after Codd’s paper. In the interim, the company had published its entire research roadmap in academic journals.

The Cost of Caution

IBM’s delay is one of the canonical examples of the innovator’s dilemma in computing history. The company that invented the relational model owned the research, the personnel, and the customer relationships necessary to dominate the database market. By choosing to protect existing revenue rather than cannibalize it, IBM ceded the market to competitors who read its own publications. Oracle, Sybase, Ingres, and Informix all built their products substantially on research IBM had done and published. When DB2 finally shipped, Oracle had four years of enterprise sales momentum and an established customer base. IBM never recovered the database market leadership its research had earned.

The 12 Rules: Enforcing the Standard

By the early 1980s, the marketplace was filling with systems that claimed to be relational but deviated from Codd’s model in significant ways. The SQL language had been standardized (ANSI SQL, 1986) and was becoming universal, but Codd had reservations about it: SQL allowed duplicate rows (violating the set-theoretic foundation of relational algebra), handled nulls inconsistently (allowing three-valued logic that created subtle query errors), and included non-relational features that he regarded as corruptions of the model.

In October 1985, Codd published a two-part article in Computerworld magazine — not an academic journal, but the trade publication read by database administrators and IT managers — titled “Is Your DBMS Really Relational?” He listed twelve rules (numbered 0 through 12; Rule 0 was the master rule) that any system had to satisfy to claim the relational label.

The rules covered everything from fundamental requirements (data must be accessible through a relational mechanism, not through pointers or other physical navigation) to specific behaviors (the system must have a comprehensive data sublanguage, null values must be handled consistently, physical data independence must be fully enforced). By Codd’s own criteria, most commercial systems sold as relational databases failed to meet several of them. IBM’s own DB2 failed some. Oracle failed some.

The rules were a polemic as much as a standard — a public argument that the industry was selling the name of the relational model without delivering its substance. They were also largely ignored by the industry, which had its own benchmarks and customer priorities. But they established a reference standard that database researchers and theorists continued to use for decades, and they crystallized the gap between the formal model Codd had invented and the pragmatic systems that had appropriated its vocabulary.

The Turing Award and Later Years

The Turing Award — computing’s Nobel Prize — was awarded to Edgar Codd in 1981, eleven years after the paper that justified it. The citation read: “For his fundamental and continuing contributions to the theory and practice of database management systems, especially his definition of the relational model of data.”

By the time of the award, Codd was in his late fifties and in declining health. He suffered from progressive hearing loss that eventually made sustained academic participation difficult. He left IBM in 1984 after thirty-five years with the company, the last decade of which had been spent watching the industry build on his work while arguing about whether it was implementing it correctly.

After leaving IBM, he co-founded The Codd & Date Consulting Group with Chris Date, a database theorist who became his principal collaborator and intellectual heir. Together they developed RM/T (Relational Model/Tasmania, named for Date’s home state at the time), an extended relational model that addressed aspects Codd felt the original model had left underspecified: time-varying data, complex objects, and semantic data modeling.

In 1990, Codd published “The Relational Model for Database Management: Version 2” — an attempt to state the full relational model as he believed it should be implemented, incorporating decades of refinements. The book was ambitious and theoretically rigorous; it was also published at a moment when the database industry had settled on SQL as its standard and was not interested in revising its foundations. The book was received respectfully and had limited practical impact.

Codd died on April 18, 2003, in Williams Island, Florida, at age 79. He was survived by his wife Sharon, whom he had married in 1978.

Dead End: The Model That Won Without Its Author

The paradox of Codd’s career is that his central contribution — the relational model — succeeded more completely than almost any other theoretical idea in the history of computing, while Codd himself spent his final decades arguing that the industry was implementing it incorrectly.

SQL, the language that became universal, was not the language Codd would have designed. It allows duplicate rows in query results (relations, by definition, have no duplicates). Its handling of null values — representing missing or unknown information — introduces three-valued logic (true/false/unknown) that Codd considered a theoretical error producing subtle, hard-to-detect query bugs. Its non-orthogonal design includes multiple ways to express the same query with potentially different results. Codd’s 12 Rules were a direct response to these deviations; the industry read them, acknowledged them, and continued with SQL.

The deeper irony is commercial. The relational model became the foundation of a database industry worth hundreds of billions of dollars — Oracle alone reached a market capitalization exceeding $300 billion — built largely on Codd’s 1970 paper and IBM’s subsequent System R research. IBM, which employed Codd, funded System R, and published the research, ended up as a secondary player in the market those investments should have dominated. Larry Ellison, who read IBM’s papers and moved faster, built the empire.

Codd received the Turing Award, an IBM Research Fellow title, and academic recognition. The financial rewards of the industry he created went almost entirely to others.

What Codd did achieve was more durable than any one company: a mathematical foundation so solid that fifty years of commercial pragmatism, SQL deviations, and NoSQL challenges have not dislodged it. The relational model remains the default structure for transactional data in the global economy — banking, retail, healthcare, logistics, government. The specific systems have changed; the underlying logic has not. In the long run, the twelve pages won.