Reinforcement Learning

Zusammenfassung

Reinforcement learning (RL) is the third major paradigm of machine learning, distinct from supervised learning (learning from labeled examples) and unsupervised learning (finding structure in unlabeled data). In RL, an agent learns by taking actions in an environment, receiving rewards or penalties for those actions, and updating its behavior to maximize cumulative reward over time. The paradigm traces to Samuel’s checkers program (1959), was formalized through temporal difference learning in the 1980s, produced TD-Gammon (1992), the first program to beat world-class players at backgammon, and then became the foundation of modern AI breakthroughs: AlphaGo’s defeat of Lee Sedol (2016), OpenAI Five’s victory at Dota 2 (2019), and — most consequentially for everyday AI — the Reinforcement Learning from Human Feedback (RLHF) technique that made ChatGPT useful in 2022.

The Basic Problem: Learning from Consequences

Reinforcement learning is modeled on how biological organisms learn. When a rat presses a lever and receives food, it presses the lever more. When it receives a shock, it avoids the lever. The rat is not told what to do (as in supervised learning) or simply asked to find patterns in data (as in unsupervised learning) — it learns through the consequences of its own actions.

The formal framework was developed by Richard Bellman in the 1950s through dynamic programming: the optimal way to make decisions over time when future rewards depend on current choices. Bellman’s equations describe how to compute the expected long-term reward from any state given a policy (a mapping from states to actions), and how to improve policies when those expectations are known.

The key challenge: Bellman’s equations require knowing the probability of transitioning from one state to another as a result of each action — a “model” of the environment. For most interesting problems (chess, robot locomotion, language generation), this model is intractably large or unknown. Model-free RL learns optimal behavior through direct experience without building an explicit model.

TD-Learning and Temporal Difference

Richard Sutton and Andrew Barto, working at the University of Massachusetts Amherst, developed the theoretical foundations of modern RL through the 1980s. Their key contribution was temporal difference (TD) learning: a method for learning value functions (predictions of future reward) by comparing predictions at successive time steps.

The intuition: if you predict that being in state A will eventually lead to 10 points of reward, and you then experience state A and move to state B (which you predict leads to 12 points), you should update your prediction for state A upward — because the new evidence (reaching state B, which is more valuable than expected) suggests state A was undervalued. This update can be made during the experience rather than waiting for the episode to complete.

TD learning was inspired partly by behavioral psychology — it resembles the neural reward prediction error signals later discovered in dopaminergic neurons — and partly by Bellman’s formal theory. Sutton and Barto’s textbook Reinforcement Learning: An Introduction (1998, second edition 2018) is the definitive reference for the field.

TD-Gammon: Beating World-Class Backgammon Players

Gerald Tesauro at IBM Research created TD-Gammon (1992), the first RL system to reach world-class performance at a complex game. TD-Gammon used a neural network to approximate the value function for backgammon positions — the expected probability of winning from each board state — and updated the network using TD learning during self-play.

The result was startling. TD-Gammon not only reached expert-level play, it discovered moves and strategies that expert human players had not considered. Tesauro reported that top backgammon players incorporated TD-Gammon’s novel strategies into their own games after studying how it played. This was the first instance of RL producing superhuman strategy in a complex board game — a preview of what AlphaGo would do twenty-four years later.

TD-Gammon had a major limitation: it worked for backgammon because backgammon’s game tree is manageable and self-play generates diverse positions efficiently. Attempts to apply the same approach directly to chess, Go, or other games failed.

The Self-Play Breakthrough

TD-Gammon demonstrated that self-play — an agent playing against copies of itself — could generate training experience without human expert knowledge beyond the rules of the game. This was significant: it implied that in domains where expert knowledge was scarce or bottlenecked, an RL agent could potentially surpass human expertise purely through computational experience. Self-play with RL would later be the core insight behind AlphaGo Zero, which learned Go from scratch with no human game records.

Q-Learning and the Formal Foundation

Q-learning, developed by Christopher Watkins in his 1989 PhD thesis at Cambridge, provided the theoretical foundation for model-free RL. Q-learning estimates the value (Q-value) of taking action a in state s — the expected future reward from that point if optimal actions are taken thereafter. The Q-values are updated incrementally as the agent experiences state transitions and receives rewards.

Q-learning’s key theoretical result (Watkins and Dayan, 1992): given sufficient exploration and learning rate conditions, Q-learning converges to the optimal Q-values regardless of the policy used during learning. This made it a general tool for sequential decision-making problems.

Q-learning worked on small state spaces but scaled poorly: maintaining a table of Q-values requires storing a value for every (state, action) pair, which becomes intractable for large or continuous state spaces. This “curse of dimensionality” was the central limitation of tabular RL.

Deep Q-Networks: Atari from Pixels

The breakthrough that connected deep learning to reinforcement learning came from DeepMind. In 2013, Volodymyr Mnih and colleagues published “Playing Atari with Deep Reinforcement Learning,” demonstrating that a convolutional neural network (CNN) could serve as the function approximator for Q-values — replacing the tabular Q-value table with a neural network that generalized across visually similar states.

The Deep Q-Network (DQN) was trained on raw pixel inputs from Atari 2600 games. The agent saw the screen pixels and the game score; it controlled the joystick. No game-specific features were hand-engineered. Two key innovations made training stable:

Experience replay: Rather than updating the network immediately on each experience, experiences were stored in a replay buffer and sampled randomly for training. This broke correlations between consecutive training samples that would otherwise destabilize learning.
Target network: A second copy of the network was used to generate target values and updated less frequently than the main network, reducing oscillations in training.

The 2015 Nature paper showed DQN achieving human-level performance on 49 Atari games — superhuman on many — from pixels alone, with the same algorithm and hyperparameters across all games. The result demonstrated that deep reinforcement learning could generalize across diverse tasks from raw perceptual input. It launched a wave of research into deep RL.

Policy Gradient Methods and Complex Domains

Q-learning and its variants optimize a value function; policy gradient methods optimize the policy directly. The key insight: if we can compute how changing the policy parameters would change the expected reward, we can use gradient ascent to improve the policy.

REINFORCE (Williams, 1992) was the foundational policy gradient algorithm. Its successors — Actor-Critic, Proximal Policy Optimization (PPO) (Schulman et al., OpenAI, 2017), Soft Actor-Critic (SAC) — addressed the high variance of pure policy gradient methods while maintaining their ability to handle continuous action spaces.

PPO in particular became the dominant RL algorithm for complex domains because it was robust, relatively easy to tune, and applicable to both discrete and continuous action spaces. It was the algorithm used in OpenAI Five (Dota 2, 2019), where five RL agents trained through self-play for the equivalent of 45,000 years of game time (in compute) before defeating the Dota 2 world champions.

AlphaGo and AlphaGo Zero (DeepMind, 2016/2017) combined policy gradient methods with Monte Carlo Tree Search (MCTS): a neural network predicted both the value of board positions and the probability distribution over moves, while MCTS used these predictions to search the game tree more efficiently than pure MCTS. AlphaGo Zero, trained only through self-play with no human game records, surpassed AlphaGo’s performance within 40 days of training.

RLHF: Making Language Models Useful

The most consequential application of reinforcement learning in the 2020s was Reinforcement Learning from Human Feedback (RLHF), developed by OpenAI as the key technique for making large language models (LLMs) produce helpful, harmless, and honest outputs.

The problem: a language model trained on next-token prediction learns to predict internet text. Internet text includes harmful content, factual errors, and text that is plausible but not helpful. The model optimizes for text likelihood, not for human preferences about responses.

RLHF addresses this in three steps:

Supervised fine-tuning: Start with a pre-trained LLM, fine-tune it on human-written examples of good responses.
Reward model training: Show human raters pairs of model outputs; they rate which is better. Train a separate model to predict human preferences from (prompt, response) pairs.
RL fine-tuning: Use the reward model as a reward signal and PPO to optimize the LLM’s outputs toward responses the reward model predicts humans will prefer, while adding a KL divergence penalty to prevent the model from diverging too far from the supervised baseline.

InstructGPT (Ouyang et al., OpenAI, 2022) demonstrated that RLHF dramatically improved model helpfulness on instruction-following tasks, with the 1.3B parameter InstructGPT model preferred over the 175B GPT-3 in human evaluations. ChatGPT (2022) was built on a similar RLHF pipeline. The technique is now standard across all major LLM providers.

RLHF’s limitation is that it optimizes for human rater preferences, which may not perfectly align with actual human values. Raters can be fooled by confident-sounding wrong answers, responses that seem helpful but aren’t, or outputs optimized for immediate approval rather than long-term benefit. The field of AI alignment research is substantially concerned with improving on RLHF.