Zum Inhalt springen

Unicode and UTF-8: The Long Road to a Universal Character Set

Zusammenfassung

Before Unicode, every computer system spoke a different language — not in metaphor, but in the literal encoding of text. A file created on a Japanese NEC PC-9800 was unreadable on an American IBM PC. Email sent from Europe garbled its accented characters in American inboxes. Web pages with Chinese characters were indistinguishable from random noise on browsers that expected Latin text. The Unicode Consortium was founded in 1991 to solve this problem permanently: one encoding to represent every character in every writing system on Earth. The solution was sound in principle and chaotic in practice, requiring nearly two decades to fully converge — and producing along the way one of the most elegant engineering decisions in computing history, scribbled on a placemat in a New Jersey diner.

The Tower of Babel: Pre-Unicode Character Encoding

The fundamental problem was that computers store text as numbers, and different systems used different numbers for the same characters.

ASCII (American Standard Code for Information Interchange, 1963) encoded 128 characters: the 26 uppercase and 26 lowercase Latin letters, 10 digits, punctuation, and 33 control characters. With 7 bits per character, ASCII was sufficient for English. It was not sufficient for French, German, Spanish, Portuguese, or any language using diacritics — let alone Arabic, Hebrew, Russian, Chinese, Japanese, Korean, or any of the thousands of other writing systems used by human beings.

The response was a proliferation of incompatible encodings. ISO 8859-1 (Latin-1) added 96 additional characters for Western European languages using the 8th bit. ISO 8859-5 covered Cyrillic; ISO 8859-6 covered Arabic; ISO 8859-8 covered Hebrew. Each used the same 128-255 byte range for different purposes, meaning a byte value of 0xE9 meant “é” in Latin-1, “щ” in Cyrillic, and a different character entirely in Greek.

East Asian languages were harder. Chinese, Japanese, and Korean collectively use tens of thousands of characters — far more than 256. Japan alone developed multiple competing encodings: Shift_JIS (developed by Microsoft Japan and others), EUC-JP (Unix-based systems), ISO-2022-JP (email), and JIS X standards covering different contexts. The same Japanese document might display correctly in one environment and as random characters in another depending on which encoding the receiving system assumed.

The code page system that MS-DOS and early Windows used was a pragmatic acknowledgment of the mess: a single byte could mean different things depending on which code page was active. Code page 437 was the original IBM PC encoding. Code page 850 was the multilingual Latin extension. Code page 932 was Shift_JIS. Switching between them required rebooting or explicit software configuration. Multilingual documents — text containing both Latin and Cyrillic characters, or both Chinese and Japanese — were effectively impossible within a single file.

The Unicode Consortium

The Unicode project began in 1987 at Xerox and Apple, whose engineers were independently solving the same problem and decided to collaborate. Joe Becker at Xerox coined the name “Unicode” — a universal, uniform, unique encoding. Lee Collins and Mark Davis at Apple joined the effort. The original design goal was simple: one 16-bit code per character, sufficient for 65,536 distinct characters, which the founders believed would be enough for all living writing systems.

The Unicode Consortium was formalized in 1991. Version 1.0 of the Unicode Standard was published that year, covering 7,161 characters. The initial scope was quickly revealed to be too optimistic: the CJK Unified Ideographs block alone required 20,902 characters for the basic unified Chinese/Japanese/Korean ideographic set, which had been compressed through a controversial “unification” process that treated visually similar characters in Chinese, Japanese, and Korean as the same code point — a decision that made computer scientists comfortable and linguists furious.

By Unicode 2.0 (1996), the Consortium had expanded the code space to 1,114,112 code points (the current maximum), divided into 17 “planes” of 65,536 code points each. Plane 0 (the Basic Multilingual Plane, or BMP) contained most living writing systems. Planes 1 and 2 handled supplementary characters including historic scripts, emoji, and CJK extension ideographs. Planes 3–13 were reserved. Planes 14–16 covered special uses.

The Great CJK Unification Controversy

The decision to unify visually similar Chinese, Japanese, and Korean characters into single Unicode code points — “Han unification” — was contested from the start. Chinese, Japanese, and Korean typographers argued that their writing systems had distinct visual traditions for characters that shared historical origins; a character that looks like 言 in Chinese traditional form looks like 言 in a different way in Japanese. The same code point, rendered in different fonts, could display the “correct” version for one language and a “wrong” version for another. The technical solution (using language tags or font selection to disambiguate) satisfied the engineers who implemented it. It has never fully satisfied the users affected by it.

Ken Thompson’s Placemat: The Invention of UTF-8

The 16-bit fixed-width encoding that early Unicode used — UCS-2 — had a catastrophic flaw for Unix systems: null bytes (0x00) were valid characters. Unix programs throughout the 1970s and 1980s used the null byte as the string terminator — the byte that signals “the string ends here.” A 16-bit encoding that could contain null bytes in the middle of ordinary text would break every Unix string-handling function ever written.

In September 1992, Rob Pike and Ken Thompson were having dinner at a diner in New Jersey when they worked out the solution on a placemat. The result was UTF-8: a variable-length encoding for Unicode that had several crucial properties:

  • ASCII characters (code points 0–127) were represented as a single byte with the same value as ASCII — making UTF-8 a superset of ASCII.
  • No non-ASCII character would ever produce a null byte.
  • No byte from a multi-byte sequence would duplicate a byte from any other code point — making it possible to scan a UTF-8 string forward or backward and always find character boundaries.
  • Self-synchronizing: if you started reading a UTF-8 sequence at any byte, you could determine whether you were at the start of a character.

The encoding scheme used the high bits of the first byte to indicate how many bytes the character occupied:

  • 0xxxxxxx: single-byte character (ASCII)
  • 110xxxxx 10xxxxxx: two-byte character
  • 1110xxxx 10xxxxxx 10xxxxxx: three-byte character
  • 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx: four-byte character

The placemat design was elegant enough that Plan 9 (Bell Labs’ successor to Unix) adopted UTF-8 immediately. But the broader adoption was slow.

The BOM Wars

Microsoft’s approach to Unicode was UCS-2 in Windows NT (1993) and later UTF-16 (which could encode all Unicode code points, not just the BMP). Windows used a Byte Order Mark (BOM) — a special byte sequence (0xFF 0xFE or 0xFE 0xFF) at the beginning of a text file — to indicate byte order and encoding.

The BOM was technically defensible: it allowed a text reader to determine whether a file was big-endian or little-endian UTF-16 without external metadata. It was practically catastrophic for UTF-8 files that included it: the three-byte BOM (0xEF 0xBB 0xBF) confused Unix programs that expected the file to start with actual content, broke shell scripts (which relied on the first characters being #!), and caused subtle failures in programs that processed text files line by line.

The BOM wars — the decade-long conflict between Windows text files that included the BOM and Unix/web text files that did not — was one of the more tedious encoding problems of the 2000s. UTF-8 without BOM became the standard for web content; Windows programs continued writing the BOM by default; both sides found reasons to claim the other was wrong.

The Web Convergence

In the early 2000s, web pages used a chaotic mix of encodings. A page from a Japanese site might be Shift_JIS; a page from a Russian site might be Windows-1251; a Western European page might be ISO 8859-1 or Windows-1252. Browsers maintained lookup tables for dozens of encodings and relied on HTTP Content-Type headers and HTML <meta charset> declarations to identify which encoding applied.

The convergence to UTF-8 was slow and driven by the web’s growth. As websites sought international audiences, UTF-8 — which could represent any Unicode character — was the only practical choice. Google began pushing UTF-8 as the encoding for web content through its crawlers and search ranking; pages in consistent encodings ranked better. The W3C declared UTF-8 the recommended encoding for HTML in 2000. In 2008, for the first time, more than 50% of web pages used UTF-8, surpassing ASCII as the most common encoding.

By 2020, UTF-8 accounted for over 97% of web pages. The encoding wars had ended. The long tail of legacy systems — mainframe EBCDIC, Shift_JIS in older Japanese software, Windows-1252 in older Word documents — persisted in specific contexts but no longer defined the web’s lingua franca.

Emoji and the Supplementary Planes

Unicode’s expansion to supplementary planes — code points beyond the BMP — was initially used for historic scripts, mathematical symbols, and the full CJK extension set. Then emoji arrived.

SoftBank in Japan introduced emoji in 1997; NTT DoCoMo created a competing set in 1999. Both used proprietary encodings that worked only on specific Japanese mobile networks. When Apple launched the iPhone in Japan in 2008, it needed to support these emoji sets to compete in the market. Apple, Google, and the Unicode Consortium negotiated the encoding of a standardized emoji set into Unicode in 2010, with Emoji 1.0 published in 2015.

The subsequent emoji expansion — hundreds of new emoji approved in annual Unicode releases — generated cultural debates the character encoding standards bodies had never anticipated: which skin tones to represent, whether certain flags should be included, how to handle combinations like family emoji. Unicode 13.0 (2020) included 3,521 emoji. The supplementary planes, designed for historic scripts that no living person spoke, were shared with pictures of pizza and thumbs-up gestures.


📚 Sources