Universal Pose Pretraining for Generalizable VLA Policies

Abstract

Existing Vision-Language-Action (VLA) models often suffer from feature collapse and low training efficiency because they entangle high-level perception with sparse, embodiment-specific action supervision. Since these models typically rely on VLM backbones optimized for Visual Question Answering (VQA), they excel at semantic identification but often overlook subtle 3D state variations that dictate distinct action patterns.

To resolve these misalignments, we propose Pose-VLA, a decoupled paradigm that separates VLA training into a pre-training phase for extracting universal 3D spatial priors in a unified camera-centric space, and a post-training phase for efficient embodiment alignment within robot-specific action space.

Extensive evaluations demonstrate that Pose-VLA achieves state-of-the-art results on RoboTwin 2.0 with a 79.5% average success rate and competitive performance on LIBERO at 96.0%. Real-world experiments further showcase robust generalization across diverse objects using only 100 demonstrations per task.

Video Demo

Video 1: Pose-VLA Execution. Real-world experiments showcasing robust manipulation across diverse tasks.

Method Overview

Figure 1: Pipeline of Pose-VLA. Pose-VLA decouples VLA training into: (1) Pre-training for extracting universal 3D spatial priors in a camera-centric space, and (2) Post-training for embodiment alignment. The VLM predicts a structured sequence $\mathcal{S} = (\tau_1, \dots, \tau_T)$ via next-token prediction, where each tuple $\tau_t = \{\mathbf{c}_t, \mathbf{b}_t, \mathbf{p}_t\}$ consists of a category $\mathbf{c}_t$, 2D box center $\mathbf{b}_t$, and camera-centric pose $\mathbf{p}_t$. To enhance spatial reasoning, auxiliary 3D geometry priors are integrated via additive fusion with RGB embeddings, analogous to positional encodings. This unified format enables seamless knowledge transfer from diverse 3D datasets to robotic domains, achieving robust alignment with minimal demonstrations.

Experimental Results

Universal 3D Spatial Grounding

Figure 2: Generalization of 3D Spatial Grounding. Pose-VLA exhibits robust generalization across various unseen settings, ranging from indoor tabletop layouts to complex robotic manipulation workspaces, providing more precise geometric localization than baseline methods.

Comparison on RoboTwin 2.0 Simulation

Manipulation Task	$\pi_0$		$\pi_{0.5}$		$\pi_0$ (PaliGemma)		Pose-VLA (w/o depth)
Manipulation Task	Easy	Hard	Easy	Hard	Easy	Hard	Easy	Hard
Adjust Bottle	70%	57%	62%	69%	67%	47%	97%	77%
Beat Block Hammer	80%	73%	78%	93%	40%	37%	100%	87%
Click Alarmclock	83%	73%	92%	89%	80%	57%	83%	93%
Dump Bin Bigbin	100%	90%	89%	97%	73%	77%	97%	97%
Grab Roller	93%	97%	100%	100%	53%	57%	97%	100%
Handover Block	33%	40%	84%	57%	7%	17%	73%	80%
Lift Pot	63%	73%	100%	85%	37%	30%	100%	97%
Move Pillbottle Pad	73%	67%	85%	61%	10%	20%	90%	87%
... (50 tasks in total) ...
Open Laptop	73%	70%	88%	96%	67%	70%	93%	93%
Pick Dual Bottles	47%	50%	21%	63%	10%	0%	87%	87%
Place A2b Left	60%	43%	84%	82%	27%	10%	90%	80%
Place Cans Plasticbox	83%	73%	90%	84%	23%	17%	97%	87%
Place Container Plate	93%	100%	89%	95%	90%	80%	97%	100%
Place Dual Shoes	63%	50%	93%	75%	3%	0%	87%	77%
Place Object Scale	70%	43%	82%	80%	10%	10%	67%	83%
Place Phone Stand	70%	57%	83%	81%	30%	20%	87%	87%
Place Shoe	87%	80%	96%	93%	23%	30%	100%	87%
Average (%)	67.00	65.12	79.48	76.16	35.40	33.36	79.91	79.10

Citation

@misc{lin2026posevla,
    title={Universal Pose Pretraining for Generalizable Vision-Language-Action Policies},
    author={Haitao Lin and Hanyang Yu and Jingshun Huang and He Zhang and Yonggen Ling and Ping Tan and Xiangyang Xue and Yanwei Fu},
    year={2026},
    eprint={YOUR_ARXIV_ID},
    archivePrefix={arXiv},
    primaryClass={cs.RO}
}