Habits of Mind: Reusing Action Sequences for Efficient Planning
When we exercise sequences of actions, their execution becomes more fluent and precise. Here, we consider the possibility that exercised action sequences can also be used to make planning faster and more accurate by focusing expansion of the search tree on paths that have been frequently used in the past, and by reducing deep planning problems to shallow ones via multi-step jumps in the tree. To capture such sequences, we use a flexible Bayesian action chunking mechanism which finds and exploits statistically reliable structure at different scales. This gives rise to shorter or longer routines that can be embedded into a Monte-Carlo tree search planner. We plan to show the benefits of this scheme using a physical construction task patterned after tangrams.
Figure 1 : a Schematic of the MCTS-with-HABITS model. The planner builds a search tree with nodes representing states and edges representing different possible actions, marked by colors (left box). The planner traverses the tree by choosing actions that ultimately lead to more simulated wins and that are more predictable by the sequence model (upper right box). At the leaf state (bold node) several potential actions (dashed edges) are considered that lead to states (dashed nodes) that have not been added to the tree yet. The red and the green actions are primitive actions whose winning values (pies) are the same. Yet, the green action has a higher likelihood under the sequence model, when conditioned on the action trace from the root to the leaf node. Therefore, the green action will be preferred over red . The green action is also predictably followed, given the context of the action trace, by the blue action. Therefore, the chunk generator (lower right box) appends the blue action to the green one. Thus we extend the available action set by the green-blue chunk. If the action chunk is selected by the planner then it jumps over the state marked by a grey node, not considering the available actions from that state, also marked by grey. This effective stunting results in a different value estimate for the chunk green-blue compared to the primitive action green ; in this example, the chunk will be preferred by the tree policy. b Performance on baseline (random) silhouettes decreased for all model variants to similar degrees due to the node budget restriction (top). In the case of test silhouettes where action sequences were reusable, the full MCTS-with-HABITS model showed the best performance in the face of node budget restriction (bottom).
Figure 1 : a Schematic of the MCTS-with-HABITS model. The planner builds a search tree with nodes representing states and edges representing different possible actions, marked by colors (left box). The planner traverses the tree by choosing actions that ultimately lead to more simulated wins and that are more predictable by the sequence model (upper right box). At the leaf state (bold node) several potential actions (dashed edges) are considered that lead to states (dashed nodes) that have not been added to the tree yet. The red and the green actions are primitive actions whose winning values (pies) are the same. Yet, the green action has a higher likelihood under the sequence model, when conditioned on the action trace from the root to the leaf node. Therefore, the green action will be preferred over red . The green action is also predictably followed, given the context of the action trace, by the blue action. Therefore, the chunk generator (lower right box) appends the blue action to the green one. Thus we extend the available action set by the green-blue chunk. If the action chunk is selected by the planner then it jumps over the state marked by a grey node, not considering the available actions from that state, also marked by grey. This effective stunting results in a different value estimate for the chunk green-blue compared to the primitive action green ; in this example, the chunk will be preferred by the tree policy. b Performance on baseline (random) silhouettes decreased for all model variants to similar degrees due to the node budget restriction (top). In the case of test silhouettes where action sequences were reusable, the full MCTS-with-HABITS model showed the best performance in the face of node budget restriction (bottom).