Animals and artificial agents face uncertainty about their environment. This uncertainty can stem from ignorance, forgetting, or unsignalled changes. Exploration is thus critical. Appropriately balancing exploration off against exploitation is a notoriously difficult task, and no general solution exists for all but the simplest problems (Gittins and Jones, 1979). Humans and other animals, however, are very capable of efficient exploration, sometimes doing so near-optimally (Wilson et al., 2014). Despite the wealth of experimental evidence, little is known about the computational mechanisms which generate exploratory choices in the brain.
A venerable idea from reinforcement learning is that exploratory choices can be planned offline (Sutton, 1991), which in animals could happen during periods of quiet wakefulness and sleep. One promising candidate for supporting such offline planning is hippocampal replay. Indeed, a recent theory (Mattar and Daw, 2018) suggested that hippocampal replay is an optimised scheme for scheduling planning computations in the brain. This idea has been very successful in explaining a range of experimental data examining replay prioritisation in humans and other animals.
Despite its success, the theory nonetheless makes a simplifying assumption that the environment with which the agent interacts is fully known. As such, the patterns of replay it predicts result in pure exploitation of the assumed knowledge about a given task. In our work (Antonov and Dayan, 2023), we extend the theory to the case of partial observability by explicitly handling uncertainty the agent has about its environment. This allows us to examine how replay prioritisation should be affected by uncertainty and subjective beliefs about a task (Fig. 1). We generate testable predictions for future studies which predict that replay might play a role in guiding directed exploration.
Figure 1:Exploratory replay leads to online discoveries.A) Initial state of knowledge of the agent (green dot) whilst situated at the start state. The arrows at each state show the available actions; red colour intensity shows the model-free Q-value for each action. The states are coloured in purple according to their state values (maximal Q-value). The greedy policy is shown by arrows with white outlines. The thick blue lines indicate potential barriers. The inset next to the top-most barrier shows the agent’s belief about the presence of that barrier; for the bottom barrier the agent was certain about its absence. B) Exploratory replay which resulted due to the agent’s uncertain state of knowledge in A). The numbers correspond to the order in which the actions were replayed. The colour intensity shows the amount of change to the model-free Q-values engendered by each replay update. C) The new updated policy occasioned by the replay updates in B). Note how the greedy policy with respect to the updated action values indicated exploration was worthwhile.
Figure 1:Exploratory replay leads to online discoveries.A) Initial state of knowledge of the agent (green dot) whilst situated at the start state. The arrows at each state show the available actions; red colour intensity shows the model-free Q-value for each action. The states are coloured in purple according to their state values (maximal Q-value). The greedy policy is shown by arrows with white outlines. The thick blue lines indicate potential barriers. The inset next to the top-most barrier shows the agent’s belief about the presence of that barrier; for the bottom barrier the agent was certain about its absence. B) Exploratory replay which resulted due to the agent’s uncertain state of knowledge in A). The numbers correspond to the order in which the actions were replayed. The colour intensity shows the amount of change to the model-free Q-values engendered by each replay update. C) The new updated policy occasioned by the replay updates in B). Note how the greedy policy with respect to the updated action values indicated exploration was worthwhile.