Existing Vision-Language-Action (VLA) models often suffer from feature collapse and low training efficiency because they entangle high-level perception with sparse, embodiment-specific action supervision. Since these models typically rely on VLM backbones optimized for Visual Question Answering (VQA), they excel at semantic identification but often overlook subtle 3D state variations that dictate distinct action patterns.
To resolve these misalignments, we propose Pose-VLA, a decoupled paradigm that separates VLA training into a pre-training phase for extracting universal 3D spatial priors in a unified camera-centric space, and a post-training phase for efficient embodiment alignment within robot-specific action space.
Extensive evaluations demonstrate that Pose-VLA achieves state-of-the-art results on RoboTwin 2.0 with a 79.5% average success rate and competitive performance on LIBERO at 96.0%. Real-world experiments further showcase robust generalization across diverse objects using only 100 demonstrations per task.
Video 1: Pose-VLA Execution. Real-world experiments showcasing robust manipulation across diverse tasks.
Figure 1: Pipeline of Pose-VLA. Pose-VLA decouples VLA training into: (1) Pre-training for extracting universal 3D spatial priors in a camera-centric space, and (2) Post-training for embodiment alignment. The VLM predicts a structured sequence $\mathcal{S} = (\tau_1, \dots, \tau_T)$ via next-token prediction, where each tuple $\tau_t = \{\mathbf{c}_t, \mathbf{b}_t, \mathbf{p}_t\}$ consists of a category $\mathbf{c}_t$, 2D box center $\mathbf{b}_t$, and camera-centric pose $\mathbf{p}_t$. To enhance spatial reasoning, auxiliary 3D geometry priors are integrated via additive fusion with RGB embeddings, analogous to positional encodings. This unified format enables seamless knowledge transfer from diverse 3D datasets to robotic domains, achieving robust alignment with minimal demonstrations.
Figure 2: Generalization of 3D Spatial Grounding. Pose-VLA exhibits robust generalization across various unseen settings, ranging from indoor tabletop layouts to complex robotic manipulation workspaces, providing more precise geometric localization than baseline methods.
| Manipulation Task | $\pi_0$ | $\pi_{0.5}$ | $\pi_0$ (PaliGemma) | Pose-VLA (w/o depth) | ||||
|---|---|---|---|---|---|---|---|---|
| Easy | Hard | Easy | Hard | Easy | Hard | Easy | Hard | |
| Adjust Bottle | 70% | 57% | 62% | 69% | 67% | 47% | 97% | 77% |
| Beat Block Hammer | 80% | 73% | 78% | 93% | 40% | 37% | 100% | 87% |
| Click Alarmclock | 83% | 73% | 92% | 89% | 80% | 57% | 83% | 93% |
| Dump Bin Bigbin | 100% | 90% | 89% | 97% | 73% | 77% | 97% | 97% |
| Grab Roller | 93% | 97% | 100% | 100% | 53% | 57% | 97% | 100% |
| Handover Block | 33% | 40% | 84% | 57% | 7% | 17% | 73% | 80% |
| Lift Pot | 63% | 73% | 100% | 85% | 37% | 30% | 100% | 97% |
| Move Pillbottle Pad | 73% | 67% | 85% | 61% | 10% | 20% | 90% | 87% |
| ... (50 tasks in total) ... | ||||||||
| Open Laptop | 73% | 70% | 88% | 96% | 67% | 70% | 93% | 93% |
| Pick Dual Bottles | 47% | 50% | 21% | 63% | 10% | 0% | 87% | 87% |
| Place A2b Left | 60% | 43% | 84% | 82% | 27% | 10% | 90% | 80% |
| Place Cans Plasticbox | 83% | 73% | 90% | 84% | 23% | 17% | 97% | 87% |
| Place Container Plate | 93% | 100% | 89% | 95% | 90% | 80% | 97% | 100% |
| Place Dual Shoes | 63% | 50% | 93% | 75% | 3% | 0% | 87% | 77% |
| Place Object Scale | 70% | 43% | 82% | 80% | 10% | 10% | 67% | 83% |
| Place Phone Stand | 70% | 57% | 83% | 81% | 30% | 20% | 87% | 87% |
| Place Shoe | 87% | 80% | 96% | 93% | 23% | 30% | 100% | 87% |
| Average (%) | 67.00 | 65.12 | 79.48 | 76.16 | 35.40 | 33.36 | 79.91 | 79.10 |
@misc{lin2026posevla,
title={Universal Pose Pretraining for Generalizable Vision-Language-Action Policies},
author={Haitao Lin and Hanyang Yu and Jingshun Huang and He Zhang and Yonggen Ling and Ping Tan and Xiangyang Xue and Yanwei Fu},
year={2026},
eprint={YOUR_ARXIV_ID},
archivePrefix={arXiv},
primaryClass={cs.RO}
}