World Pilot: Steering Vision-Language-Action Models with World-Action Priors

Zefu Lin^1,2,* Rongxu Cui^3,* Junjia Xu^3,* Xiaojuan Jin¹ Wenling Li³ Lue Fan^1,✉ Zhaoxiang Zhang^1,2,✉

¹Institute of Automation, Chinese Academy of Sciences (CASIA) ²Nanjing University ³Beihang University

Contact: {linzefu2022, lue.fan}@ia.ac.cn

* Equal contribution. Corresponding authors.

Paper Demo Code Hugging Face

Demo

Add assets/WorldPilot-video-web.mp4 to preview the project video locally.

Overview

Teaser figure summarizing the World Pilot method, benchmark gains, and real-robot tasks. — World Pilot steers a VLA with priors from a World-Action Model. VLA methods generate actions from a VLM’s encoding of the scene. World Pilot adds two priors from a WAM into the decision chain, with Latent Steering routing a scene-evolution latent into VLM hidden states and Action Steering feeding a trajectory-level motion prior to the action generator. This gives the VLA an anticipated view of the scene and a motion hint alongside its semantic conditioning. World Pilot reaches state-of-the-art performance on LIBERO-Plus and real-robot tasks.

Abstract

Vision-Language-Action (VLA) models inherit semantic grounding from large-scale pretraining and perform competently across in-distribution manipulation tasks. This grounding, however, is built on static image-text pairs, whereas manipulation is a continuous, contact-rich process whose dynamics this pretraining cannot capture. We present World Pilot, a VLA framework that augments the policy with priors from a World-Action Model (WAM), routed into the decision chain through two complementary pathways. Latent Steering conditions the perception layer on a scene-evolution latent, and Action Steering supplies an anticipated trajectory as a motion prior to the action generator. Together the two priors equip the VLA with an anticipated view of the scene and a trajectory-level motion hint alongside its semantic conditioning, and the scene-evolution prior remains effective even when supplied by a video-pretrained world model that has not been action-post-trained. World Pilot attains a state-of-the-art Total success rate of 84.7% on the LIBERO-Plus zero-shot OOD benchmark and the highest success rate on every real-robot setting across four manipulation tasks, with the largest margins under shifts in viewpoint, geometry, deformable state, and pose.

Method

World Pilot framework diagram showing the VLM path, the World Action Model, and the two steering pathways. — **World Pilot architecture.** A semantic pathway encodes images and language with a VLM into hidden states. Two prior pathways from a World-Action Model enter the same decision chain, with *Latent Steering* routing a scene-evolution latent into the VLM hidden states and *Action Steering* compressing the anticipated trajectory into a prior token for the flow-matching action generator.

Results

World Pilot improves zero-shot OOD robustness in both simulation and real-robot settings.

Simulation Results

Real-Robot Setup

Real-robot evaluation setup and task scenes showing the platform, in-distribution scenes, and out-of-distribution scenes. — **Real-robot evaluation setup and task scenes.** Left: the robot platform. Middle: in-distribution scenes that match the training conditions. Right: out-of-distribution scenes under changes in geometry, deformable state, or pose.

Real-Robot Results

Original paper table of real-robot success rates on four physical-transition tasks. — **Real-robot success rates on four physical-transition tasks.** Each task has one in-distribution (ID) setting that matches training and two out-of-distribution (OOD) variants that perturb geometry, deformable state, or pose; success is measured over 20 trials per setting. Parenthesized red values give the absolute drop from the corresponding ID setting.

Citation

@article{worldpilot2026,
  title={World Pilot: Steering Vision-Language-Action Models with World-Action Priors},
  author={Zefu Lin and Rongxu Cui and Junjia Xu and Xiaojuan Jin and Wenling Li and Lue Fan and Zhaoxiang Zhang},
  journal={arXiv preprint arXiv:2606.12403},
  year={2026}
}