Despite recent advances in learning-based behavioral planning for autonomous systems, decision-making in multi-task missions remains a challenging problem. For instance, a mission might require a robot to explore an unknown environment, locate the goals, and navigate to them, even if there are obstacles along the way. Such behavioral planning problems are difficult to solve due to: a) sparse rewards, meaning a reward signal is available only once all the tasks in a mission have been satisfied, and b) limited training data (demonstrations) that may not cover all tasks encountered at run-time e.g., demonstrations only from an environment where all doors were unlocked. As a consequence, state-of-the-art decision-making methods in such settings are limited to missions where the required tasks are well-represented in the training demonstrations and can be solved within a short (temporal) planning horizon. To overcome these limitations, we propose Adaptformer, a stochastic and adaptive planner that utilizes sequence models for sample-efficient exploration and exploitation. This framework relies on learning an energy-based heuristic, which needs to be minimized over a sequence of high-level decisions. To generate successful action sequences for long-horizon missions, Adaptformer aims to achieve shorter sub-goals. These sub-goals, in turn, are proposed through an intrinsic (learned) sub-goal curriculum. Through these two key components, Adaptformer allows for generalization to out-of-distribution tasks and environments, i.e., missions that were not a part of the training data. Empirical results in multiple BABYAI environments demonstrate the effectiveness of our method. Notably, Adaptformer not only outperforms the state-of-the-art (Chen et al. 2023) method by ~15% in multi-goal maze reachability task. In addition, it also successfully adapts to multi-task missions that the state-of-the-art method could not complete, leveraging only demonstrations (for training) on single-goal-reaching tasks.
The Adaptformer, trained on offline data (A), incorporates a Goal Augmentation module that outputs a set of waypoints (B). Concurrently, the energy module is designed to assign lower energy to an optimal set of actions (C). Training involves alternating gradient updates to both the generator and the discriminator (D), promoting the policy to learn diverse representations. At the inference stage, the system employs the learned stochastic policy to query the masked trajectory sequence (E), which is then refined through iterative energy minimization (F), framing the path planning as an optimization problem.
Action plans are generated through concatenating action sequence of context trajectories with minimal energy in a iterative fashion. In test scenario, we extend the scope of the task to include multi-goal reaching problems, often in the presence of distractors, or other sequence-related tasks that must be completed as part of the overall objective.
Stochastic Policy. This allows for capturing the multi-modality in the data, as evidenced by three different successful paths, while the start position, goal, and environment remain fixed.
In our experiments we also oberserve that over longer iterations shorter (near-optimal) paths are also formed as in the first video below
No Entropy. Without the entropy lower-bound, we observe that the adaptive skills remain unlearned, and the agent is stuck in a local region.
Zero-Shot Adaptation to Multi-Goal Environments. Trained on a single-goal reaching task with open-Door environment
Train Demonstration (single goal reaching task) |
LEAP (fails!) |
Adaptformer |
Adaptfromer |
Adaptfromer |
Zero Shot : Adaption to Door opening and navigation. Adaptformer learns to perfrom even from random demonstration!
GotoSeq : Adaption to obstracle unblocking. Adaptformer learns to adaptive skills required for task success
Train Demonstration |
LEAP |
Adaptformer |
Adaptformer |
Adaptformer |
Sub-goal conditioning.
AdaptFormer, when conditioned with sub-goals, implicitly assigns minimum energy values to subgoals (pick-up key, open doors) required for task completion. States closer to the white region (low-energy) are more likely to be transitioned into, indicating a higher probability of moving toward these preferred states. Conversely, LEAP does not pick up the sub-tasks associated with the task.
KeyCorridorS3R3 - "Get to purple ball" |
Additional Observations. The context length is viewed as proxy for memory and retains the states visited. By intuition using larger context length for long horizon planning should be beneficial but the improvement was not notatble and we fix our context length as 20 for all the experiments.