NVIDIA
VLA-0 converts a VLM into a VLA by prompting it to predict actions as text
Vision-Language-Action models (VLAs) hold immense promise for enabling generalist robot manipulation. However, the best way to build them remains an open question. Current approaches often add complexity, such as modifying the existing vocabulary of a Vision-Language Model (VLM) with action tokens or introducing special action heads.
Curiously, the simplest strategy of representing actions directly as text has remained largely unexplored.
This work introduces VLA-0 to investigate this idea. We find that VLA-0 is not only effective; it is surprisingly powerful. With the right design, VLA-0 outperforms more involved models.
We categorize existing VLAs into three families. VLA-0 takes the simplest approach.
Examples: RT-2, OpenVLA
Examples: π₀, SmolVLA
Examples: OpenVLA-OFT, π-FAST
Zero modification approach
VLA-0 outperforms SmolVLA on real robot tasks using the SO-100 platform
Real-world demonstrations: block reorientation, apple pushing, banana and cupcake pick-and-place
VLA-0 achieves the best performance among models without large-scale pretraining
Model | Large-scale Pretrain | Type | Spatial | Object | Goal | Long | Average | Avg. Rank |
---|---|---|---|---|---|---|---|---|
Models without large-scale pretraining | ||||||||
Diffusion Policy | N/A | 78.3 | 92.5 | 68.3 | 50.5 | 72.4 | 6.5 | |
π₀-FAST (Paligemma) | Custom | 87.0 | 63.0 | 89.0 | 48.0 | 71.8 | 6.0 | |
SmolVLA (0.24B) | Gen Head | 87.0 | 93.0 | 88.0 | 63.0 | 82.8 | 5.3 | |
SmolVLA (2.25B) | Gen Head | 93.0 | 94.0 | 91.0 | 77.0 | 88.8 | 4.0 | |
OpenVLA-OFT | Custom | 94.3 | 95.2 | 91.7 | 86.5 | 91.9 | 2.8 | |
π₀.₅-KI | Gen Head | 96.6 | 97.2 | 94.6 | 85.8 | 93.3 | 2.3 | |
VLA-0 (Ours) | Simple | 97.0 | 97.8 | 96.2 | 87.6 | 94.7 | 1.0 | |
Models with large-scale pretraining (for reference) | ||||||||
Octo | Gen Head | 78.9 | 85.7 | 84.6 | 51.1 | 75.1 | 8.8 | |
OpenVLA | Dis. Tok. | 84.7 | 88.4 | 79.2 | 53.7 | 76.5 | 8.0 | |
π₀-FAST | Custom | 90.0 | 86.0 | 95.0 | 73.0 | 86.0 | 6.5 | |
MolmoAct | Dis. Tok. | 87.0 | 95.4 | 87.6 | 77.2 | 86.8 | 6.5 | |
GR00T-N1 | Gen Head | 94.4 | 97.6 | 93.0 | 90.6 | 93.9 | 4.5 | |
π₀ | Gen Head | 96.8 | 98.8 | 95.8 | 85.2 | 94.2 | 3.3 | |
π₀.₅-KI | Gen Head | 98.0 | 97.8 | 95.6 | 85.8 | 94.3 | 3.0 | |
OpenVLA-OFT | Custom | 97.6 | 98.4 | 97.9 | 94.5 | 97.1 | 1.5 | |
VLA-0 (Ours) | Simple | 97.0 | 97.8 | 96.2 | 87.6 | 94.7 | 2.8 |
If you find VLA-0 useful in your research, please consider citing:
@article{goyal2025vla0,
title={VLA-0: Building State-of-the-Art VLAs with Zero Modification},
author={Goyal, Ankit and Hadfield, Hugo and Yang, Xuning and Blukis, Valts and Ramos, Fabio},
journal={arXiv preprint arXiv:2510.13054},
year={2025}
}