Universal Pose Pretraining for Generalizable
Vision-Language-Action Policies

1Tencent Robotics X   2HKUST   3Fudan University

* Equal contribution   † Internship at Tencent Robotics X

Abstract

Existing Vision-Language-Action (VLA) models often suffer from feature collapse and low training efficiency because they entangle high-level perception with sparse, embodiment-specific action supervision. Since these models typically rely on VLM backbones optimized for Visual Question Answering (VQA), they excel at semantic identification but often overlook subtle 3D state variations that dictate distinct action patterns.

To resolve these misalignments, we propose Pose-VLA, a decoupled paradigm that separates VLA training into a pre-training phase for extracting universal 3D spatial priors in a unified camera-centric space, and a post-training phase for efficient embodiment alignment within robot-specific action space.

Extensive evaluations demonstrate that Pose-VLA achieves state-of-the-art results on RoboTwin 2.0 with a 79.5% average success rate and competitive performance on LIBERO at 96.0%. Real-world experiments further showcase robust generalization across diverse objects using only 100 demonstrations per task.

Video Demo

Video 1: Pose-VLA Execution. Real-world experiments showcasing robust manipulation across diverse tasks.

Method Overview

Pose-VLA Architecture Diagram

Figure 1: Pipeline of Pose-VLA. Pose-VLA decouples VLA training into: (1) Pre-training for extracting universal 3D spatial priors in a camera-centric space, and (2) Post-training for embodiment alignment. The VLM predicts a structured sequence $\mathcal{S} = (\tau_1, \dots, \tau_T)$ via next-token prediction, where each tuple $\tau_t = \{\mathbf{c}_t, \mathbf{b}_t, \mathbf{p}_t\}$ consists of a category $\mathbf{c}_t$, 2D box center $\mathbf{b}_t$, and camera-centric pose $\mathbf{p}_t$. To enhance spatial reasoning, auxiliary 3D geometry priors are integrated via additive fusion with RGB embeddings, analogous to positional encodings. This unified format enables seamless knowledge transfer from diverse 3D datasets to robotic domains, achieving robust alignment with minimal demonstrations.

Experimental Results

Universal 3D Spatial Grounding

3D Spatial Grounding Visualization

Figure 2: Generalization of 3D Spatial Grounding. Pose-VLA exhibits robust generalization across various unseen settings, ranging from indoor tabletop layouts to complex robotic manipulation workspaces, providing more precise geometric localization than baseline methods.

Comparison on RoboTwin 2.0 Simulation

Manipulation Task $\pi_0$ $\pi_{0.5}$ $\pi_0$ (PaliGemma) Pose-VLA (w/o depth)
EasyHard EasyHard EasyHard EasyHard
Adjust Bottle70%57%62%69%67%47%97%77%
Beat Block Hammer80%73%78%93%40%37%100%87%
Click Alarmclock83%73%92%89%80%57%83%93%
Dump Bin Bigbin100%90%89%97%73%77%97%97%
Grab Roller93%97%100%100%53%57%97%100%
Handover Block33%40%84%57%7%17%73%80%
Lift Pot63%73%100%85%37%30%100%97%
Move Pillbottle Pad73%67%85%61%10%20%90%87%
... (50 tasks in total) ...
Open Laptop73%70%88%96%67%70%93%93%
Pick Dual Bottles47%50%21%63%10%0%87%87%
Place A2b Left60%43%84%82%27%10%90%80%
Place Cans Plasticbox83%73%90%84%23%17%97%87%
Place Container Plate93%100%89%95%90%80%97%100%
Place Dual Shoes63%50%93%75%3%0%87%77%
Place Object Scale70%43%82%80%10%10%67%83%
Place Phone Stand70%57%83%81%30%20%87%87%
Place Shoe87%80%96%93%23%30%100%87%
Average (%) 67.0065.12 79.4876.16 35.4033.36 79.9179.10

Citation

@misc{lin2026posevla,
    title={Universal Pose Pretraining for Generalizable Vision-Language-Action Policies},
    author={Haitao Lin and Hanyang Yu and Jingshun Huang and He Zhang and Yonggen Ling and Ping Tan and Xiangyang Xue and Yanwei Fu},
    year={2026},
    eprint={YOUR_ARXIV_ID},
    archivePrefix={arXiv},
    primaryClass={cs.RO}
}