VLA-0

Building State-of-the-Art VLAs with Zero Modification

Ankit Goyal Hugo Hadfield Xuning Yang Valts Blukis Fabio Ramos

NVIDIA

VLA-0 converts a VLM into a VLA by prompting it to predict actions as text

Achieves state-of-the-art results without any architectural changes

Summary

Vision-Language-Action models (VLAs) hold immense promise for enabling generalist robot manipulation. However, the best way to build them remains an open question. Current approaches often add complexity, such as modifying the existing vocabulary of a Vision-Language Model (VLM) with action tokens or introducing special action heads.

Curiously, the simplest strategy of representing actions directly as text has remained largely unexplored.

This work introduces VLA-0 to investigate this idea. We find that VLA-0 is not only effective; it is surprisingly powerful. With the right design, VLA-0 outperforms more involved models.

Key Findings

  • ✓ Outperforms all methods trained on the same robotic data on LIBERO benchmark
  • ✓ Outperforms methods with large-scale pretraining (π₀, π₀.₅-KI, GR00T-N1, MolmoAct)
  • ✓ Outperforms SmolVLA in real-world tasks despite no large-scale pretraining
  • ✓ Requires no architectural changes to the base VLM

How VLA-0 Compares

We categorize existing VLAs into three families. VLA-0 takes the simplest approach.

VLA Comparison
🔢

Discrete Token VLAs

Examples: RT-2, OpenVLA

  • Discretize actions into bins
  • Assign tokens from VLM vocabulary
  • ⚠️ Limited action resolution
  • ⚠️ Compromises language understanding
🧠

Generative Action Head VLAs

Examples: π₀, SmolVLA

  • Attach action generation head
  • Use diffusion or flow matching
  • ⚠️ Requires new neural network
  • ⚠️ May degrade VLM capabilities
⚙️

Custom Architecture VLAs

Examples: OpenVLA-OFT, π-FAST

  • Specialized modifications
  • Custom tokenizers
  • ⚠️ Significant changes
  • ⚠️ Complex training pipelines

VLA-0 (Ours)

Zero modification approach

  • ✓ Actions as text (integers)
  • ✓ No vocabulary changes
  • ✓ No architectural changes
  • ✓ Arbitrary action resolution

Results

Real-World Performance

VLA-0 outperforms SmolVLA on real robot tasks using the SO-100 platform

Real-world demonstrations: block reorientation, apple pushing, banana and cupcake pick-and-place

🤖
4
Tasks Evaluated
📊
+12.5%
vs SmolVLA
4 Hz
Inference Speed
🎓
100
Demos per Task
Key Insight: VLA-0 achieves these results without any large-scale SO-100 pretraining, while SmolVLA was specifically pretrained on this dataset.

Performance on LIBERO Benchmark

VLA-0 achieves the best performance among models without large-scale pretraining

Model Large-scale Pretrain Type Spatial Object Goal Long Average Avg. Rank
Models without large-scale pretraining
Diffusion Policy N/A 78.3 92.5 68.3 50.5 72.4 6.5
π₀-FAST (Paligemma) Custom 87.0 63.0 89.0 48.0 71.8 6.0
SmolVLA (0.24B) Gen Head 87.0 93.0 88.0 63.0 82.8 5.3
SmolVLA (2.25B) Gen Head 93.0 94.0 91.0 77.0 88.8 4.0
OpenVLA-OFT Custom 94.3 95.2 91.7 86.5 91.9 2.8
π₀.₅-KI Gen Head 96.6 97.2 94.6 85.8 93.3 2.3
VLA-0 (Ours) Simple 97.0 97.8 96.2 87.6 94.7 1.0
Models with large-scale pretraining (for reference)
Octo Gen Head 78.9 85.7 84.6 51.1 75.1 8.8
OpenVLA Dis. Tok. 84.7 88.4 79.2 53.7 76.5 8.0
π₀-FAST Custom 90.0 86.0 95.0 73.0 86.0 6.5
MolmoAct Dis. Tok. 87.0 95.4 87.6 77.2 86.8 6.5
GR00T-N1 Gen Head 94.4 97.6 93.0 90.6 93.9 4.5
π₀ Gen Head 96.8 98.8 95.8 85.2 94.2 3.3
π₀.₅-KI Gen Head 98.0 97.8 95.6 85.8 94.3 3.0
OpenVLA-OFT Custom 97.6 98.4 97.9 94.5 97.1 1.5
VLA-0 (Ours) Simple 97.0 97.8 96.2 87.6 94.7 2.8
94.7%
Average Success Rate
Best among non-pretrained models
#1
Rank
Among non-pretrained VLAs
#2.8
Overall Rank
Including pretrained models

Citation

If you find VLA-0 useful in your research, please consider citing:

@article{goyal2025vla0,
  title={VLA-0: Building State-of-the-Art VLAs with Zero Modification},
  author={Goyal, Ankit and Hadfield, Hugo and Yang, Xuning and Blukis, Valts and Ramos, Fabio},
  journal={arXiv preprint arXiv:2510.13054},
  year={2025}
}