Mood-Inspired Meta-Learning for Fast Adaptation in Non-stationary Environments with Temporal Autocorrelation
Yannick Streicher, Peter Dayan
Based on ideas from Eldar et al (2016), we lay the theoretical groundwork to explore the role of mood in humans, particularly how it may accelerate learning in environments with temporally correlated changes in reward. To investigate this, we focus on roving multi-armed bandits as a general model for such dynamic scenarios. We derive optimal behavior under the assumption of problem knowledge, and describe a learning agent capable of adjusting to temporal structures autonomously. We assess the learning agent's ability to match theoretically optimal behavior and examine its relationship with established models of mood.
We consider a variant of a multi-armed bandit problem, a decision-making dilemma involving exploration and exploitation. In particular, we introduce an additional hidden environmental factor that influences the general quality of each option, and that changes over time in a correlated manner, changing the way that an optimal decision-maker should explore. We solve the problem as a partially observable Markov decision process and use model-based planning. Moreover, to explore how an agent can act optimally without prior knowledge of the hidden factor's dynamics, we build on meta-gradient reinforcement learning (Xu et al. 2018). We had already successfully applied this method to optimize exploration in a stationary bandit task. In meta-gradient reinforcement learning, along with optimizing the policy of the agent on each round of training, we optimize the way it learns from experience with respect to a global objective.
Existing models of mood, as in Eldar et al. 2016, fit well into this framework as longer-term mood influences human perception of rewards, indicating that mood could be a part of a subjective reward function. The meta-gradient agent has the potential to uncover the objectives that mood seeks to achieve by learning an optimal subjective reward function in context of this problem.
We establish baselines for this particular roving bandit problem, and use them to gain insights into the working and effectiveness of meta-gradient reinforcement learning. By establishing a reference against which current models of mood can be compared, we plan to contribute to the discussion of mood models.