Computer Vision

Zusammenfassung

Computer vision is the field that tries to give machines the ability to see — to extract meaning from images and video the way humans do effortlessly and computers, for decades, could not at all. It is one of the oldest dreams of artificial intelligence and one of its most humbling. A famous 1966 MIT project once imagined that “solving vision” could be a summer’s work for an undergraduate; it took roughly fifty years. The field’s history runs from hand-crafted edge detectors and geometric models, through the slow accumulation of feature-engineering techniques, to the deep learning earthquake of 2012, when a convolutional neural network shattered every benchmark and changed not just computer vision but all of AI. Today vision systems read medical scans, drive cars, unlock phones with faces, and power the image-generation models reshaping art. This article traces the field as a whole, complementing the entries on convolutional networks and ImageNet.

The “Summer Vision Project” and the Hard Problem

In 1966, Seymour Papert at MIT launched the Summer Vision Project, assigning students to spend the summer building a system that could segment a scene into objects and describe them. The optimism captured the era’s misjudgment: tasks humans find trivial (recognizing a cup, telling a cat from a dog) turned out to be among the hardest to program, while tasks humans find hard (chess, arithmetic) turned out to be relatively easy for machines. This inversion is now called Moravec’s paradox — the realization that sensorimotor and perceptual skills, honed by billions of years of evolution, are computationally enormous, while abstract reasoning is comparatively cheap.

The core difficulty is that an image is just a grid of numbers (pixel intensities), and the same object produces wildly different number grids depending on lighting, angle, distance, occlusion, and background. A vision system must find the invariant structure — “this is a chair” — beneath all that variation. For decades this proved intractable.

The Classical Era: Edges, Features, and Geometry

Early computer vision built understanding from the bottom up. David Marr, in his influential (posthumously published) 1982 book Vision, proposed a layered theory: from a raw image to a “primal sketch” of edges and blobs, to a “2½-D sketch” of surfaces, to a full 3-D model. Researchers built edge detectors (the Canny edge detector, 1986, remains a textbook staple), corner detectors, and methods to fit geometric shapes.

A major strand was feature engineering: hand-designing mathematical descriptors that captured distinctive, repeatable image patches robust to scale and rotation. David Lowe’s SIFT (Scale-Invariant Feature Transform, 1999) and later SURF and HOG (Histogram of Oriented Gradients) let systems match the same object across different photos, enabling panorama stitching, object recognition, and early structure-from-motion. The Viola–Jones face detector (2001) was a landmark practical success — fast enough to run in cameras, it put the green focus boxes on faces in consumer digital cameras throughout the 2000s.

This era worked, within limits. But every new task demanded clever new hand-designed features, and performance on the hardest problem — recognizing thousands of object categories in arbitrary photos — stayed stubbornly poor. By the late 2000s, accuracy on general image classification was improving only incrementally.

The 2012 Earthquake: AlexNet

Everything changed in a single year. The ImageNet dataset, assembled under Fei-Fei Li from 2009, provided something the field had lacked: roughly 14 million labeled images across thousands of categories, enough data to train very large models. The associated competition, the ILSVRC, became the field’s proving ground.

In 2012, a deep convolutional neural network called AlexNet — built by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton at the University of Toronto — won the ImageNet competition by a stunning margin, cutting the error rate from around 26% to about 16% while every competitor used classical methods. AlexNet’s ingredients were not entirely new (convolutional nets dated to LeCun’s 1980s–90s work, themselves rooted in Fukushima’s 1980 Neocognitron), but three things came together: large labeled data (ImageNet), GPU compute (training on Nvidia GPUs made deep nets feasible), and refinements like the ReLU activation and dropout regularization.

The result was a paradigm collapse. Within two years, essentially the entire computer vision community abandoned hand-crafted features for learned features — letting the network discover, layer by layer, the edges, textures, parts, and objects that engineers had spent decades hand-designing. Deeper architectures followed in rapid succession: VGG, GoogLeNet/Inception, and crucially ResNet (2015), whose “residual connections” enabled networks hundreds of layers deep and pushed image-classification error below the human benchmark on ImageNet. The 2012 ImageNet moment is widely regarded as the spark of the entire modern deep-learning era — its impact reached far beyond vision into speech, language, and everything that followed.

Beyond Classification: Detection, Segmentation, Generation

Once classification was solved, the field tackled harder structured tasks. Object detection (locating and labeling multiple objects in an image) advanced through R-CNN, Fast/Faster R-CNN, and the real-time YOLO (“You Only Look Once”) family. Semantic and instance segmentation (labeling every pixel) matured with architectures like U-Net (now ubiquitous in medical imaging) and Mask R-CNN.

More recently, the Transformer architecture, which conquered language, migrated to vision as the Vision Transformer (ViT) in 2020, challenging the long dominance of convolutions. And vision fused with language: CLIP (2021) learned a shared representation of images and text, enabling “zero-shot” recognition and powering the text-to-image generation revolution. Foundation models like Meta’s Segment Anything (2023) brought general-purpose, promptable segmentation. Vision became less a separate field and more a modality woven into multimodal AI systems.

Where Vision Lives Now

Computer vision is among the most thoroughly deployed AI technologies. It powers face unlock on phones, medical imaging triage (detecting tumors, diabetic retinopathy), autonomous vehicles and driver assistance, industrial inspection on assembly lines, agricultural crop and weed analysis, satellite and aerial imagery analysis, augmented reality, and content moderation at platform scale. It is also the engine of pervasive surveillance — automated facial recognition by states and companies — which has made vision a central battleground for AI ethics and civil liberties (see AI Ethics and Algorithmic Bias).

Dead End: The Hand-Crafted Feature Pipeline

The most instructive dead end in computer vision is the entire feature-engineering paradigm that dominated the field for roughly forty years. Brilliant, mathematically elegant techniques — SIFT, HOG, wavelets, deformable part models, carefully tuned multi-stage pipelines — represented thousands of person-years of human ingenuity. Researchers built careers on designing better hand-crafted features, and the conventional wisdom held that domain expertise and clever feature design were the path forward.

The deep learning revolution rendered nearly all of it obsolete almost overnight. A network given enough data and compute learned better features than humans could design, and it learned them automatically for each new task. The lesson — sometimes called “the bitter lesson” after a 2019 essay by Rich Sutton — was that general methods leveraging computation and data tend to win over approaches that bake in human knowledge, however clever. It was a humbling reversal: the field’s accumulated expertise in how to see was largely supplanted by the brute combination of data and learning. The classical techniques survive in niches (low-power devices, geometric problems like camera calibration, situations with little training data), but the era of designing vision systems by hand is over.

📚 Sources

The MIT Summer Vision Project (1966) — Seymour Papert memo — the original assignment that underestimated vision
David Marr, Vision (1982) — MIT Press — the foundational computational theory of vision
David Lowe, “Distinctive Image Features from Scale-Invariant Keypoints” (SIFT, 2004) — the canonical hand-crafted feature descriptor
Krizhevsky, Sutskever & Hinton, “ImageNet Classification with Deep CNNs” (AlexNet, 2012) — the paper that triggered the deep learning era
He et al., “Deep Residual Learning” (ResNet, 2015) — residual connections and superhuman ImageNet accuracy
Dosovitskiy et al., “An Image Is Worth 16x16 Words” (Vision Transformer, 2020) — Transformers applied to vision
Rich Sutton, “The Bitter Lesson” (2019) — why data and compute beat hand-engineered knowledge