Zum Inhalt springen

Famous Software Disasters

Zusammenfassung

Software fails differently from bridges. A bridge that collapses was overloaded or badly built; a program that kills people usually did exactly what it was written to do. The canonical software disasters — the Therac-25 radiation overdoses, the Ariane 5 explosion, the Patriot missile timing drift, the Mars Climate Orbiter unit mix-up, Knight Capital’s $440 million in 45 minutes, and the UK Post Office Horizon scandal — are studied not because they are freak events but because each one exposes a systemic failure mode that still exists: reused code in a new context, removed safety interlocks, accumulated rounding error, mismatched assumptions between teams, untested deployment procedures, and institutions that trusted computer output over human testimony.

Therac-25: When Software Replaced the Interlock

The Therac-25 was a radiation therapy machine built by Atomic Energy of Canada Limited (AECL). Its predecessors, the Therac-6 and Therac-20, had hardware interlocks that physically prevented the electron beam from firing at high power without the beam-spreading target in place. In the Therac-25, those hardware interlocks were removed; safety was enforced by software alone.

Between June 1985 and January 1987, the machine massively overdosed six patients in the US and Canada. At least three died from the overdoses. The principal defect was a race condition: if the operator entered and corrected treatment data quickly enough — something experienced operators did routinely — the machine could fire the full-power electron beam without the target in place, delivering doses estimated at over 100 times the intended amount. The machine reported only a cryptic “MALFUNCTION 54,” and operators, conditioned to frequent benign error messages, resumed treatment.

Nancy Leveson and Clark Turner’s investigation, published in IEEE Computer in 1993, became the founding case study of software safety engineering. Their conclusions reached beyond the bug: AECL had reused code from the older machines without systematic review, had no independent software testing, performed no failure analysis that included software, and repeatedly assured users the machine could not overdose. The lesson was institutional, not technical — see Secure by Design: A History of Software Security Engineering for the parallel history in security.

Ariane 5 Flight 501: Reused Code, New Rocket

On June 4, 1996, the first Ariane 5 rocket veered off course and self-destructed about 40 seconds after liftoff from Kourou, French Guiana, destroying the four Cluster science satellites it carried — a loss commonly put at several hundred million dollars.

The inquiry board’s report (the Lions report) traced the failure to a single software operation: a conversion of a 64-bit floating-point value — the rocket’s horizontal velocity — into a 16-bit signed integer. The code had been written for the Ariane 4, whose flight profile kept the value within range. Ariane 5 accelerated faster; the value overflowed; the inertial reference system shut down and emitted a diagnostic bit pattern that the flight computer interpreted as flight data. The identical backup unit, running identical software, had failed identically 72 milliseconds earlier. The function that overflowed was an alignment routine that served no purpose after liftoff on the Ariane 5 at all — it kept running because it had on Ariane 4.

Info

Ariane 501 is the standard citation for two distinct lessons: reuse is not free (code carries the assumptions of its original environment), and redundancy does not protect against design faults (two identical computers running the same wrong program fail together).

The Patriot Missile and 0.1 Seconds

On February 25, 1991, during the Gulf War, a Patriot missile battery in Dhahran, Saudi Arabia failed to track an incoming Iraqi Scud missile. The Scud struck a US Army barracks, killing 28 soldiers.

The US General Accounting Office investigation found the cause in arithmetic: the system counted time in tenths of seconds, and 0.1 has no exact representation in binary. The 24-bit fixed-point approximation introduced a tiny error each tick. The system had been designed for short deployments, but the Dhahran battery had been running for about 100 hours — long enough for the accumulated clock error of about 0.34 seconds to shift the range gate nearly 700 meters, so the system looked for the Scud in the wrong part of the sky and dismissed it. A corrected software version arrived in Dhahran the day after the strike.

Mars Climate Orbiter: Pounds vs. Newtons

On September 23, 1999, NASA’s Mars Climate Orbiter disappeared during orbital insertion at Mars. The mishap investigation found that ground software supplied by Lockheed Martin computed thruster impulse in pound-force seconds, while the navigation software at JPL expected newton-seconds — the metric unit specified in the interface documentation. Every trajectory correction had been slightly wrong for months; the spacecraft entered the Martian atmosphere far below its intended altitude and was destroyed. The total project cost was over $300 million.

The failure is remembered as “the metric mix-up,” but the mishap board emphasized process: navigators had noticed anomalies for months without the discrepancy being escalated and resolved. The bug was a symptom; the disaster was a communication failure between institutions.

Knight Capital: $440 Million in 45 Minutes

On August 1, 2012, Knight Capital Group — then the largest trader in US equities — deployed new order-routing software to its eight production servers. The deployment reached only seven. The new code reused a configuration flag that, on the eighth server, activated Power Peg, a dormant test routine from 2003 that bought high and sold low to probe market behavior, with its order-tracking removed years earlier.

When the market opened, the eighth server fired millions of unintended orders across 154 stocks. In about 45 minutes, Knight accumulated a loss of approximately $440 million — roughly four times its previous year’s profit. The firm survived only days as an independent company before an emergency rescue and eventual acquisition. The SEC’s order against Knight became required reading for deployment engineering: no automated deployment verification, no kill switch, alerts that went unheeded, and reuse of a flag whose old meaning was still wired into production. See High-Frequency Trading for the surrounding ecosystem.

The Post Office Horizon Scandal: Trusting the Computer

The slowest-moving disaster on this list killed no one in an instant and lost no rocket — it destroyed lives through institutional faith in software. From 1999, the UK Post Office rolled out Horizon, an accounting system built by ICL/Fujitsu, to thousands of sub-post offices. Horizon contained defects that could produce phantom accounting shortfalls. The Post Office treated the computer’s figures as unimpeachable evidence and, between 1999 and 2015, prosecuted more than 900 subpostmasters for theft and false accounting. People were imprisoned, bankrupted, and ostracized; some died before being cleared.

A 2019 High Court ruling found Horizon contained “bugs, errors and defects” capable of causing the shortfalls, and in 2024 the UK Parliament passed legislation quashing the convictions en masse — widely described as the most extensive miscarriage of justice in British legal history. The disaster was less the bugs than the legal presumption that the machine was right and the human was lying.

What the Disasters Share

  • Reuse without revalidation — Therac-25 (Therac-20 code), Ariane 5 (Ariane 4 code), Knight Capital (a recycled flag). Code embeds the assumptions of its original environment, and the assumptions do not travel with it.
  • Safety margins removed because software “worked” — the Therac-25’s deleted hardware interlocks; the Post Office’s removal of human doubt.
  • Accumulation — Patriot’s rounding error and Mars Climate Orbiter’s repeated small thruster errors were each individually negligible and collectively fatal.
  • Warnings ignored — in every case operators, navigators, or auditors saw anomalies before the catastrophe; in every case the institution lacked a path for those observations to stop the system.

These cases are why memory safety, secure-by-design engineering, and disciplined deployment practice are not academic concerns — and why Y2K, the disaster that didn’t happen, is best read as the industry’s one great act of preventive maintenance.

📚 Sources