Hierarchical Planning with Latent World Models

1New York University, 2Meta, 3Mila, 4Brown University
Joint advising
Key Contributions
  • Hierarchical MPC in a shared latent space, enabling direct subgoal transfer across levels without policies, skills, or task-specific rewards
  • Unlocks zero-shot non-greedy planning from images, achieving 70% success on real-robot pick-and-place from a single goal image (vs. 0% for flat VJEPA2-AC planner)
  • Model-agnostic planning abstraction, consistently improving diverse latent world models (VJEPA2-AC, DINO-WM, PLDM)
  • Better performance and efficiency on long-horizon tasks, with up to +44% success gains and up to 4x lower planning cost
Figure 1b

Left: Empirical coverage. Hierarchical planning improves success on non-greedy, long-horizon tasks across multiple latent world models. ΔSR denotes the absolute success-rate gain from hierarchy. Oracle subgoals use externally provided subgoal images; otherwise, planning relies only on the final goal. Right: Planning efficiency. Success rate versus normalized planning time for flat and hierarchical planners; hierarchical planning matches or exceeds success with ≈ 3× lower planning time. Highlighted HWM results prioritize planning efficiency rather than peak performance.

Figure 1

Hierarchical planning in latent space. A high-level planner optimizes macro-actions using a long-horizon world model to reach the goal; the first predicted latent state serves as a subgoal for a low-level planner, which optimizes primitive actions with a short-horizon world model. Solid borders denote ground-truth observations; others are decoder reconstructions shown for interpretability only.

Abstract

Model predictive control (MPC) with learned world models has emerged as a promising paradigm for embodied control, particularly for its potential to generalize zero-shot when deployed in new environments. However, learned world models often struggle with long-horizon control due to the accumulation of prediction errors and the exponentially growing search space. In this work, we address these challenges by learning latent world models at multiple temporal scales and performing hierarchical planning across these scales, enabling long-horizon reasoning while substantially reducing inference-time planning complexity. Our approach serves as a plug-in planning abstraction that applies across diverse latent world-model architectures and domains. We demonstrate that this hierarchical approach enables zero-shot control on real-world non-greedy robotic tasks, achieving a 70% success rate on pick-and-place using only a final goal specification, compared to 0% for a single-level world model. In addition, across physics-based simulated environments including push manipulation and maze navigation, hierarchical planning achieves higher success while requiring up to 3x less planning-time compute.

Hierarchical Planning

Figure 2

A high-level planner optimizes macro actions using a long-horizon latent world model to reach the final goal embedding. The resulting first predicted latent state serves as a subgoal for low-level planning, where a short-horizon world model optimizes primitive actions to reach this subgoal.

Training

Figure 3

Training latent world models at multiple timescales. We train two predictors that operate in a shared latent space but at different temporal resolutions. The low-level predictor models short-horizon dynamics conditioned on latent states and primitive actions, producing step-by-step predictions of future latents. The high-level predictor models long-horizon dynamics via latent macro-actions, which are obtained by encoding sequences of low-level actions via an action encoder. Both predictors operate causally and are trained with an $\ell_1$ latent prediction loss.

Main Results

Hierarchical Planning with VJEPA2-AC: Pick-&-Place w/ Franka Arm

Franka result table

Zero-shot robotic manipulation with Franka dual-gripper, including pick-and-place and drawer manipulation. Single-level planner (VJEPA2-AC) succeeds only when the task is explicitly decomposed into greedy subtasks via manually provided subgoals. In contrast, hierarchical planning enables end-to-end task execution using only a final goal image.

Hierarchical Planning with DINO-WM: Push T

Push T result Push T medium time

Left: Push-T Performance Across Task Horizons. In each trial, a random start-goal pair d. Right: Hierarchical planning achieves higher success while using up to 3x less test-time compute than flat planners. timesteps apart is sampled from a validation trajectory.

Hierarchical Planning with PLDM: Maze Navigation

Maze result Maze medium time

Left: Diverse Maze Performance Across Task Horizons. Agents are evaluated zero-shot on navigation tasks in test maps with layouts unseen during training. Start and goal positions are randomly sampled at varying grid distances. Right: Hierarchical planning achieves higher success while using up to 3x less test-time compute than flat planners.

BibTeX

@article{park2021nerfies,
  author    = {Park, Keunhong and Sinha, Utkarsh and Barron, Jonathan T. and Bouaziz, Sofien and Goldman, Dan B and Seitz, Steven M. and Martin-Brualla, Ricardo},
  title     = {Nerfies: Deformable Neural Radiance Fields},
  journal   = {ICCV},
  year      = {2021},
}