World Pilot: Steering Vision-Language-Action Models with World-Action Priors

Zefu Lin1,2,* Rongxu Cui3,* Junjia Xu3,* Xiaojuan Jin1 Wenling Li3 Lue Fan1, Zhaoxiang Zhang1,2,

1Institute of Automation, Chinese Academy of Sciences (CASIA)  2Nanjing University   3Beihang University  

Contact: {linzefu2022, lue.fan}@ia.ac.cn

* Equal contribution. Corresponding authors.

Paper Coming Soon Demo Code Hugging Face

Demo

Overview

Teaser figure summarizing the World Pilot method, benchmark gains, and real-robot tasks.
World Pilot steers a VLA with priors from a World-Action Model. VLA methods generate actions from a VLM’s encoding of the scene. World Pilot adds two priors from a WAM into the decision chain, with Latent Steering routing a scene-evolution latent into VLM hidden states and Action Steering feeding a trajectory-level motion prior to the action generator. This gives the VLA an anticipated view of the scene and a motion hint alongside its semantic conditioning. World Pilot reaches state-of-the-art performance on LIBERO-Plus and real-robot tasks.

Abstract

Vision-Language-Action (VLA) models inherit semantic grounding from large-scale pretraining and perform competently across in-distribution manipulation tasks. This grounding, however, is built on static image-text pairs, whereas manipulation is a continuous, contact-rich process whose dynamics this pretraining cannot capture. We present World Pilot, a VLA framework that augments the policy with priors from a World-Action Model (WAM), routed into the decision chain through two complementary pathways. Latent Steering conditions the perception layer on a scene-evolution latent, and Action Steering supplies an anticipated trajectory as a motion prior to the action generator. Together the two priors equip the VLA with an anticipated view of the scene and a trajectory-level motion hint alongside its semantic conditioning, and the scene-evolution prior remains effective even when supplied by a video-pretrained world model that has not been action-post-trained. World Pilot attains a state-of-the-art Total success rate of 84.7% on the LIBERO-Plus zero-shot OOD benchmark and the highest success rate on every real-robot setting across four manipulation tasks, with the largest margins under shifts in viewpoint, geometry, deformable state, and pose.

Method

World Pilot framework diagram showing the VLM path, the World Action Model, and the two steering pathways.
World Pilot architecture. A semantic pathway encodes images and language with a VLM into hidden states. Two prior pathways from a World-Action Model enter the same decision chain, with Latent Steering routing a scene-evolution latent into the VLM hidden states and Action Steering compressing the anticipated trajectory into a prior token for the flow-matching action generator.

Results

World Pilot improves zero-shot OOD robustness in both simulation and real-robot settings.

Simulation Results

Paper table: simulation results on LIBERO, LIBERO-Plus, and RoboCasa.
Original paper table of simulation results on LIBERO, LIBERO-Plus, and RoboCasa.

Real-Robot Setup

Real-robot evaluation setup and task scenes showing the platform, in-distribution scenes, and out-of-distribution scenes.
Real-robot evaluation setup and task scenes. Left: the robot platform. Middle: in-distribution scenes that match the training conditions. Right: out-of-distribution scenes under changes in geometry, deformable state, or pose.

Real-Robot Results

Real-robot success rates on four physical-transition tasks. Each task has one in-distribution (ID) setting that matches training and two out-of-distribution (OOD) variants that perturb geometry, deformable state, or pose; success is measured over 20 trials per setting. Parenthesized red values give the absolute drop from the corresponding ID setting.
Original paper table of real-robot success rates on four physical-transition tasks.

Citation

@inproceedings{worldpilot2026,
  title = {World Pilot: Steering Vision-Language-Action Models with World-Action Priors},
  author = {Zefu Lin and Rongxu Cui and Junjia Xu and Xiaojuan Jin and Wenling Li and Lue Fan and Zhaoxiang Zhang},
  booktitle = {Coming Soon},
  year = {2026}
}