Schmidhuber, Hochreiter, and LSTM

Zusammenfassung

In 1991, a Munich student named Sepp Hochreiter proved in his diploma thesis why neural networks forget: gradients flowing back through time shrink exponentially until the learning signal vanishes. Written in German and barely read, the thesis diagnosed the central obstacle of sequence learning years before the field accepted it existed. Together with his advisor Jürgen Schmidhuber, Hochreiter then built the cure — Long Short-Term Memory (LSTM), published in 1997 after years of rejection and indifference. Two decades later, LSTM was quietly running inside Google’s speech recognition, Google Translate, Siri, and Alexa, and the 1997 paper had become one of the most cited in the history of computer science. The story of its creators is a study in contrasts: the loud, combative Schmidhuber waging public battles over credit, and the quiet Hochreiter, still in Linz, trying to mount a comeback against the Transformer that displaced his invention.

The Thesis Nobody Read

In 1991, Sepp Hochreiter (born 1967 in Mühldorf am Inn, Bavaria) submitted his diploma thesis at the Technical University of Munich, advised by a young researcher named Jürgen Schmidhuber. Untersuchungen zu dynamischen neuronalen Netzen (“Investigations into dynamic neural networks”) contained a result of fundamental importance: a mathematical analysis showing that in recurrent neural networks trained by backpropagation through time, the error signal flowing backward either shrinks exponentially — the vanishing gradient — or blows up. In practice, this meant a network could not learn dependencies spanning more than a handful of time steps. It would always forget.

The diagnosis applied not just to recurrent networks but to any deep network: it explained why deep learning didn’t work, at a time when most researchers had simply concluded that it didn’t. Yet the thesis was written in German, never widely translated, and went largely unnoticed. Yoshua Bengio and colleagues independently published a similar analysis in 1994, which became the standard citation — an early skirmish in what would become a career-long credit war (see Yoshua Bengio and the Montreal School).

Long Short-Term Memory

Diagnosis led to cure. Hochreiter and Schmidhuber designed a recurrent architecture in which the error signal could survive indefinitely: a memory cell whose internal state is updated additively rather than multiplicatively, so the gradient passing through it stays constant — they called this the Constant Error Carousel. Around the cell sit multiplicative gates, small learned valves that decide what enters the memory and what is read out of it.

The work found no warm welcome. An early version circulated as a 1995 technical report, and the full paper, “Long Short-Term Memory,” did not appear until 1997, in Neural Computation — in the middle of an AI winter, when neural networks of any kind were deeply unfashionable (see Expert Systems and the First AI Winter).

Two later additions completed the design:

The forget gate (Felix Gers, Schmidhuber, and Fred Cummins, 2000, “Learning to Forget”): a third gate letting the cell learn to reset itself, releasing its memory when it is no longer needed. Virtually every LSTM used since is this three-gate version.
Connectionist Temporal Classification (CTC) (Alex Graves and colleagues at IDSIA, 2006): a training method that lets an LSTM map an unsegmented input stream — say, raw audio — to a label sequence without knowing in advance which frame corresponds to which letter. CTC made end-to-end speech recognition with LSTM practical.

The Years in the Wilderness

Through the 2000s, Schmidhuber’s lab at IDSIA (the Dalle Molle Institute for Artificial Intelligence Research) in Lugano, Switzerland, kept refining recurrent networks while the mainstream looked elsewhere. The vindication began on the GPU: in 2011, IDSIA’s DanNet — a GPU-trained convolutional network named after Schmidhuber’s postdoc Dan Cireșan — achieved the first superhuman performance in a computer vision contest and won four image-recognition competitions in a row, a year before AlexNet made deep learning famous (see ImageNet and the Deep Learning Revolution).

Then the world’s sequence data caught up with the architecture built for it. In 2014, Ilya Sutskever’s sequence-to-sequence paper used LSTMs to translate languages end-to-end (see Ilya Sutskever and the GPT Series). In 2015, Google deployed a CTC-trained LSTM in voice search, cutting transcription errors by 49 %. In 2016, Google Translate switched to an LSTM-based neural system (GNMT), reducing translation errors by roughly 60 % overnight. LSTM powered Google’s Smart Reply, Apple’s Siri and QuickType, and Amazon’s Alexa (see The Voice Assistant Revolution). By the late 2010s the 1997 paper was, by Schmidhuber’s count, the most cited neural-network paper of the 20th century — a fair claim by any citation database: it has gathered over 100,000 citations. For a few years, almost every word a machine heard, translated, or suggested passed through a Constant Error Carousel.

Two Very Different Pioneers

Jürgen Schmidhuber (born January 17, 1963, in Munich) became deep learning’s most famous unquiet spirit. Scientific director at IDSIA for decades and, since 2021, director of the AI Initiative at KAUST in Saudi Arabia, he has pursued grand theoretical programs — artificial curiosity, Gödel machines, a formal theory of creativity — alongside a relentless campaign to correct the field’s history. He interrupted conference talks to claim priority (most famously confronting Ian Goodfellow over GANs at NIPS 2016, arguing they reinvented his 1990 “predictability minimization”), publicly disputed the 2015 Nature deep-learning review by Hinton, LeCun, and Bengio for slighting earlier work, and maintains exhaustive annotated histories of who-invented-what. The community coined a verb for it: to be “Schmidhubered.” When the 2018 Turing Award went to Hinton, LeCun, and Bengio — and not to him — many observers, whatever they thought of his manner, conceded he had a case. The New York Times had already asked in 2016 whether AI, when it matures, might call Schmidhuber “Dad.”

Sepp Hochreiter chose the opposite path. After years in bioinformatics, he became professor at Johannes Kepler University Linz in Austria in 2006, where he heads the Institute for Machine Learning. Largely unknown to the public — an Austrian business magazine profile was titled “Ohne Sepp keine Siri” (“Without Sepp, no Siri”) — he stayed close to the architecture he created. In 2024 he co-founded the Linz startup NXAI and published xLSTM, an extended LSTM with exponential gating and new memory structures, pitched as a European, more compute-efficient challenger to Transformer-based large language models.

⚠️ Dead End? Displaced by the Transformer

LSTM’s reign over sequence processing ended abruptly. The 2017 paper “Attention Is All You Need” — written largely by Google engineers who had spent years wrestling with LSTM-based translation — dispensed with recurrence entirely (see The Transformer Architecture). The fatal weakness was not accuracy but parallelism: an LSTM must process a sequence step by step, each state depending on the last, while a Transformer attends to all positions at once and therefore saturates modern GPUs. In the scaling race that produced GPT and its successors, sequential training was an unaffordable luxury, and by 2020 LSTM had vanished from frontier language models.

Yet it is a peculiar kind of dead end. The Transformer’s quadratic cost in sequence length has revived interest in recurrent designs with constant memory — xLSTM, state-space models like Mamba — and LSTMs still run in embedded and low-latency systems where attention is too expensive. The Constant Error Carousel may yet turn again. And the deeper lesson of the story stands regardless: the decisive ideas of the deep-learning era were published in 1991–1997, by a student and his advisor, to an audience that wasn’t listening. In computing, being right too early is operationally indistinguishable from being wrong — until, suddenly, it isn’t.