Of Mice, Men, and Machines

A historical perspective on reinforcement learning

June 26, 2024

Reinforcement learning, used for a wide variety of machine learning tasks, has its roots in the observation of animal behavior. Its evolution from its earliest origins in the 19^th century to the present exemplifies a path often taken in biological cybernetics: from Nature to abstraction – and back to Nature.

Claude Shannon demonstrates Theseus. Photographer unknown.

Claude Shannon demonstrates Theseus. Photographer unknown.

Senta learns to ride a bike. A vacuum robot returns to its charging station. A newborn fawn takes its first wobbly steps. Leon buys a fruit yoghurt. Spotify suggests a new song.

The similarities between these situations are so fundamental that they are easy to overlook. For one thing, an active agent tries to achieve a goal, be it locomotion, a yummy snack, or a numerical reward signal. This agent interacts with its environment by trial and error and learns from previous attempt: “Pineapple yoghurt tastes good,” “If I go too fast, I might crash,” or “This user skips songs by Taylor Swift.”

The trial-and-error search and an (often delayed) reward make all these situations examples of reinforcement learning. In computer science, reinforcement learning is one of the three basic paradigms of machine learning. To understand what it is about, it can be useful to contrast it with the other two approaches: In supervised learning, a computer is explicitly trained by human “teachers”. Often used in image recognition, the training might be something like this: “These 500 images show bikes, and these other 500 images do not show bikes.” In contrast, unsupervised learning lets the learning algorithm discover structure in its input on its own. For example, the task “Sort the fruits in this basket!” can be solved by arranging all the items according to their size, color, and shape – without teaching the machine what a plum or an apple looks like.

Thorndike’ cats and Pavlov’s dogs

So if reinforcement learning is nothing like either of the two approaches, how does it work, and where does the idea come from? It is no coincidence that the above examples range from robots and apps to animals and humans. In fact, the surprisingly old concept draws inspiration from early behaviorism: In the 1890s, the psychologist Edward Thorndike conducted a series of experiments on the learning process in cats. He summarized his main conclusions in the Law of Effect: “Of several responses made to the same situation, those which are accompanied or closely followed by satisfaction to the animal (…) will be more likely to recur; those which are accompanied or closely followed by discomfort to the animal (…) will be less likely to occur.” Almost 40 years later, the word “reinforcement” appeared for the first time in the context of learning: in the original English translation of Ivan Pavlov’s work on conditioned reflexes, in which he described his famous experiments with dogs.

The mouse that changed the world of AI

But where did the idea of implementing trial-and-error learning in a computer originate? Perhaps surprisingly, it goes back almost as far as the earliest concrete thoughts about the possibility of artificial intelligence: in the mid-20^th century, Alan Turing described a design for a “pleasure-pain system”. For a machine, “pain” and “pleasure” simply mean a numerical reward signal: the vacuum robot that returns to the station quickly might receive a high number, whereas taking more time or returning with an almost empty battery might yield a lower number.

What the father of modern computer science conceived in theory was put into practice just two years later: Claude Shannon, the founder of information theory, devised a mechanical mouse which could learn to navigate a maze by trial and error, “remembering” previous attempts. Many computer scientists consider this project, named Theseus after the Greek hero, to be the first example of Artificial Intelligence and an inspiration for the entire field.

Reward and punishment

However, to make reinforcement learning applicable to a wide range of machine learning problems, the concept had to be formalized. In its basic form, it is modeled on what mathematicians call a Markov Decision Process: a decision maker can take different actions to move from one of several possible states to another. For the vacuum robot, a state might be its location and its charge level, and an action might mean heading in a direction. After each action, the agent gets an immediate reward: numerical feedback that rewards a move toward the station and punishes an unnecessary loop. However, the agent usually does not know the exact outcome of an action; it only knows the probabilities of the possible results. The goal: to maximize a cumulative reward that combines many immediate rewards.

The delayed nature of the reward makes the reinforcement learning paradigm particularly well suited for problems that require a long-term perspective. As a result, reinforcement learning has celebrated remarkable triumphs in highly complex games such as Go or chess, autonomous driving, and in teaching robots human-like motor skills – from riding a bike to flipping pancakes. Finding a strategy which is optimal not just for each next step, but for the overall outcome, requires balancing long-term versus short-term goals: For Leon, only ever buying pineapple yoghurt is probably not ideal since there may be other yoghurts that are even tastier. Sticking with what is good enough versus trying to find better options is known as the tradeoff between exploitation (of previous experience) versus exploration, and a good strategy balances these opposing approaches.

Back to Nature

Looking at the contexts in which researchers at the Max Planck Institute for Biological Cybernetics are studying reinforcement learning today, one might say that the idea has come full circle: its precise formalism has proven useful for understanding animal and human behavior. Neuroscientists gather mounting evidence that the human brain employs mechanisms that strikingly resemble reinforcement learning algorithms. One notable example is the neurotransmitter dopamine, which is conjectured to convey an update of the predicted reward to different brain areas. Moreover, researchers use the framework of reinforcement learning to make sense of how animals and humans predict and control their environment, learn, and make decisions.

In this sense, reinforcement learning is a paragon of biological cybernetics: cribbed from Nature, it inspired seminal inventions – and once its abstract principles were distilled, it in turn became a tool for understanding Nature.