The GPU Revolution: From Pixels to Intelligence

Zusammenfassung

This article traces the history of the Graphics Processing Unit — from the fixed-function chips that drew triangles in early 3D games, through NVIDIA’s invention of the programmable GPU and the CUDA platform, to the moment in 2012 when a neural network trained on gaming hardware rewrote the history of artificial intelligence. It is the story of how a device designed to render shadows and explosions became the engine of the modern AI era — and how one company’s architectural bet turned a graphics chip into the most strategically important piece of silicon in the world.

Before the GPU: The Software Renderer

In the early 1990s, 3D graphics on a personal computer meant software rendering: the CPU calculated, pixel by pixel, which color each point on the screen should be, accounting for geometry, lighting, and texture. It was exhausting work for a processor designed to run spreadsheets and word processors. A complex scene might run at ten frames per second — jerky, limited, and consuming nearly every CPU cycle available.

The games industry worked around these constraints with clever shortcuts: flat-shaded polygons, simple geometry, fog that conveniently obscured distance. Doom (id Software, 1993) — one of the defining games of the era — was not technically 3D at all; it used a clever 2.5D rendering trick that gave the illusion of depth without the cost of true three-dimensional calculation. id Software’s John Carmack was, among other things, a specialist in squeezing visual fidelity out of inadequate hardware.

The solution to the rendering bottleneck was hardware acceleration: dedicated chips that handled specific graphical calculations — texture mapping, z-buffering, triangle setup — without involving the CPU. 3dfx Interactive’s Voodoo card (1996) was the first mass-market accelerator to make 3D gaming genuinely fluid. It connected to an existing 2D graphics card and handled 3D rendering separately. Gamers bought them by the millions.

But the Voodoo and its contemporaries were fixed-function: they could execute a specific set of graphical operations, hardwired into the silicon. A programmer could not change the lighting model or invent a new visual effect; they could only use what the hardware provided.

Jensen Huang and the Programmable GPU

Jensen Huang was born in Taiwan in 1963, moved to the United States as a teenager, and studied electrical engineering at Oregon State and Stanford. In 1993, he co-founded NVIDIA with Chris Malachowsky and Curtis Priem — three engineers who believed that visual computing was a large enough problem to sustain a dedicated company.

NVIDIA’s early products competed in the accelerator market with limited success. The company nearly ran out of money in 1995. Its first significant hit, the RIVA 128 (1997), established NVIDIA as a serious player, and the GeForce 256 (1999) introduced the term “GPU” — Graphics Processing Unit — to describe a chip that could handle geometry transformation and lighting calculations that had previously required the CPU.

The decisive architectural shift came with the GeForce 3 (2001), which introduced programmable vertex and pixel shaders: small programs that developers could write to control precisely how the hardware processed geometry and color. A fixed-function chip was a vending machine — you selected from its menu. A programmable shader was a kitchen — you specified the recipe.

CPU vs. GPU: Two Different Architectures

The fundamental difference between a CPU and a GPU is not speed but structure. A CPU has a small number of powerful, general-purpose cores — modern CPUs have 8–32 — each optimized for sequential execution, branch prediction, and low-latency response to unpredictable code. A GPU has thousands of simpler cores — modern GPUs have 10,000+ — each designed to execute the same operation simultaneously on different data. This “Single Instruction, Multiple Data” (SIMD) model is useless for general-purpose code and extraordinarily efficient for mathematical operations that apply identically to millions of values — like rendering millions of pixels, or multiplying matrices.

CUDA: Opening the Silicon to Science

Graphics rendering, it turned out, was not the only problem that looked like “apply the same operation to millions of values simultaneously.” Scientific simulation, financial modeling, signal processing, and — critically — the training of neural networks all shared the same mathematical structure.

In 2007, NVIDIA released CUDA (Compute Unified Device Architecture) — a programming model and compiler that allowed developers to write general-purpose code for NVIDIA GPUs using a C-like language. For the first time, the GPU’s parallel processing power was accessible without writing graphics shaders or pretending to render triangles.

CUDA was not an instant success. The programming model was unfamiliar; debugging tools were primitive; the early compilers were limited. But for researchers running molecular dynamics simulations, protein folding calculations, or seismic analysis, even a clunky interface to 100x faster computation was worth the learning curve. NVIDIA released a dedicated Tesla line of GPU computing cards — without display outputs, optimized for data center use — to serve this market.

The ImageNet Moment

In 2012, the implications of CUDA for machine learning became undeniable.

Alex Krizhevsky, a PhD student at the University of Toronto, had spent months training a deep convolutional neural network on the ImageNet dataset — 1.2 million labeled images. The training ran on two NVIDIA GTX 580 gaming cards, connected to a desktop PC. The resulting network, AlexNet, entered the ImageNet Large Scale Visual Recognition Challenge and achieved a top-5 error rate of 15.3% — compared to 26.2% for the second-place entry.

The margin was not incremental. It was a discontinuity. And the hardware that made it possible cost approximately $1,000.

Before AlexNet, training a neural network of that scale on a CPU cluster would have taken weeks; on two GPUs, it took days. The economics of deep learning research had changed overnight. Within a year, every serious AI research group had GPU clusters. Within three years, cloud providers were offering GPU instances by the hour. The history of this moment is told in The Rise of Artificial Intelligence.

NVIDIA’s revenue from data center GPU sales — negligible in 2012 — grew to exceed gaming revenue by 2020, and by 2023 the company’s market capitalization had exceeded $1 trillion, making it briefly the most valuable company in the world. Jensen Huang had designed a gaming chip that became the infrastructure of the AI age.

The H100 Era and Semiconductor Geopolitics

The 2020s transformed NVIDIA from a hardware company into something closer to a standard-setting infrastructure monopoly. The H100 (2022), built on TSMC’s 4nm process and containing 80 billion transistors, was not merely faster than its predecessors — it introduced hardware acceleration for transformer attention mechanisms, the core operation of large language models, that software simulations could not efficiently replicate. The H100’s NVLink interconnect allowed 8 GPUs to share memory across a single server node as if they were a single device with 640 GB of HBM memory. Training GPT-4, Claude, and Gemini all required clusters of thousands of H100s.

Demand for H100s exceeded supply by factors of 5–10 through most of 2023 and 2024. Cloud providers paid $25,000–$40,000 per GPU or $3–$5 per GPU-hour in spot markets. NVIDIA’s gross margins on H100s exceeded 70%. Its data center revenue grew from $3.8 billion in fiscal 2022 to $47.5 billion in fiscal 2024 — a twelve-fold increase in two years.

This extraordinary margin attracted geopolitical attention. The United States government classified advanced AI chips — the H100 and its successors — as controlled military technologies, implementing in October 2022 export controls that prohibited sale of H100s to China without license. The controls were extended in October 2023 to close workarounds using slightly downgraded “export-controlled” variants (the A800 and H800). NVIDIA designed China-specific chips (the H20) that complied with the export regulations while still offering significant AI inference capability, but these controls blocked China’s access to training-scale compute, which required H100 or better.

The H100’s successor, the Blackwell architecture (GB200, 2024), doubled the transformer FLOPS again and introduced networking fabrics that could connect hundreds of GPUs across a rack as a single compute unit. Jensen Huang described the compute roadmap as accelerating rather than slowing: “I see a path to a hundred-times more performance in the next four years.” Whether this represented Moore’s Law restarted for AI-specific silicon, or a temporary surge that would plateau when architecture improvements exhausted available gains, was the central uncertainty in semiconductor forecasting through 2025.

Dead End: Fixed-Function Pipelines and Custom AI Chips

The GPU’s dominance in AI training was not inevitable, and it is not permanent.

The GPU is a general-purpose parallel processor that happens to be efficient for matrix multiplication — the core operation of neural network training. It carries significant overhead from its graphics heritage: memory hierarchies designed for texture caching, rasterization units that sit idle during inference, a programming model that requires careful memory management.

Google addressed this by designing the Tensor Processing Unit (TPU) in 2013–2015: a custom ASIC optimized specifically for the matrix multiplications used in TensorFlow, Google’s machine learning framework. The TPU eliminated everything a GPU contained that was not relevant to neural network computation.

The Custom Silicon Trap

TPUs outperform GPUs on Google’s specific workloads. They are also useless for anything else, cannot run arbitrary CUDA code, and require Google’s TensorFlow framework to program effectively. The history of specialized hardware — from Lisp Machines to DSPs to early neural network chips — suggests that general-purpose hardware with a large software ecosystem consistently defeats specialized hardware with superior raw performance, unless the specialization becomes universal enough to attract its own ecosystem. Whether AI workloads are specialized enough, and whether NVIDIA’s lead can be maintained, is the defining hardware competition of the 2020s. The parallel with The Lisp Machine Era is explicit.

For the neural networks that GPUs enabled, see The Rise of Artificial Intelligence. For the semiconductor manufacturing that produces these chips, see The Semiconductor Race.