Model predictive control (MPC) with learned world models has emerged as a promising paradigm for embodied control, particularly for its potential to generalize zero-shot when deployed in new environments. However, learned world models often struggle with long-horizon control due to the accumulation of prediction errors and the exponentially growing search space. In this work, we address these challenges by learning latent world models at multiple temporal scales and performing hierarchical planning across these scales, enabling long-horizon reasoning while substantially reducing inference-time planning complexity. Our approach serves as a plug-in planning abstraction that applies across diverse latent world-model architectures and domains. We demonstrate that this hierarchical approach enables zero-shot control on real-world non-greedy robotic tasks, achieving a 70% success rate on pick-and-place using only a final goal specification, compared to 0% for a single-level world model. In addition, across physics-based simulated environments including push manipulation and maze navigation, hierarchical planning achieves higher success while requiring up to 3x less planning-time compute.
A high-level planner optimizes macro actions using a long-horizon latent world model to reach the final goal embedding. The resulting first predicted latent state serves as a subgoal for low-level planning, where a short-horizon world model optimizes primitive actions to reach this subgoal.
Training latent world models at multiple timescales. We train two predictors that operate in a shared latent space but at different temporal resolutions. The low-level predictor models short-horizon dynamics conditioned on latent states and primitive actions, producing step-by-step predictions of future latents. The high-level predictor models long-horizon dynamics via latent macro-actions, which are obtained by encoding sequences of low-level actions via an action encoder. Both predictors operate causally and are trained with an $\ell_1$ latent prediction loss.
Zero-shot robotic manipulation with Franka dual-gripper, including pick-and-place and drawer manipulation. Single-level planner (VJEPA2-AC) succeeds only when the task is explicitly decomposed into greedy subtasks via manually provided subgoals. In contrast, hierarchical planning enables end-to-end task execution using only a final goal image.
Left: Push-T Performance Across Task Horizons. In each trial, a random start-goal pair d. Right: Hierarchical planning achieves higher success while using up to 3x less test-time compute than flat planners. timesteps apart is sampled from a validation trajectory.
Left: Diverse Maze Performance Across Task Horizons. Agents are evaluated zero-shot on navigation tasks in test maps with layouts unseen during training. Start and goal positions are randomly sampled at varying grid distances. Right: Hierarchical planning achieves higher success while using up to 3x less test-time compute than flat planners.
@article{park2021nerfies,
author = {Park, Keunhong and Sinha, Utkarsh and Barron, Jonathan T. and Bouaziz, Sofien and Goldman, Dan B and Seitz, Steven M. and Martin-Brualla, Ricardo},
title = {Nerfies: Deformable Neural Radiance Fields},
journal = {ICCV},
year = {2021},
}