- [2025.05] Paper released. Check out our arXiv preprint.
- [2025.05] FineVLA-Data and RoboFine-Bench are coming soon.
- [2025.05] Code and models will be released soon. Stay tuned!
Vision-Language-Action (VLA) models are moving beyond task-level robot control toward policies that can be steered by human instructions. However, existing robot datasets pair trajectories with coarse goal-level language, leaving execution details—active arm, target object, approach direction, contact region, motion path, and final configuration—unspecified. This limits both policy learning and robotic video understanding, and the field still lacks scalable infrastructure, dedicated benchmarks, and training recipes for action-aligned fine-grained supervision.
We introduce FineVLA, a fully open-source framework that closes the loop between fine-grained data construction, robotic video understanding, scalable annotation, and steerable VLA policy learning. FineVLA includes: (1) FineVLA-Tool & FineVLA-Data, which unifies 972,247 trajectories from 10 datasets and constructs 47,159 human-verified fine-grained trajectories; (2) RoboFine-Bench, a held-out benchmark with 500 videos, 10,816 atomic facts, and 1,030 VQA questions; (3) RoboFine-VLM, a robotics-specialized VLM annotator; and (4) FineVLA-Policy, a steerable policy trained with controlled mixtures of fine-grained and raw goal-level instructions.
FineVLA-Tool converts large-scale heterogeneous robot demonstrations into fine-grained, action-aligned instruction data through four stages: (1) aggregate 972,247 trajectories from 10 open-source datasets into a unified format; (2) canonicalize action representations and remove corrupted trajectories; (3) apply DTW-based clustering to select representative demonstrations; and (4) annotate with a ten-dimensional fine-grained schema followed by human verification.
FineVLA-Data is a human-verified corpus of 47,159 fine-grained action-aligned trajectories drawn from 10 source datasets, with a 10.4× increase in instruction information density.
RoboFine-Bench evaluates fine-grained robotic video understanding through complementary VQA and captioning tracks, containing 500 videos, 10,816 atomic facts across 10 dimensions, and 1,030 VQA questions.
On RoboFine-Bench, RoboFine-VLM achieves 71.0% VQA accuracy and 83.6% captioning score, outperforming GPT-5.4 and Gemini 3.1 Pro. In RoboTwin simulation, the optimal FG:Raw = 1:1 setting reaches 86.8%/82.5% on AlohaMix-OFT Easy/Hard (+15.0/+11.1 over Raw-only). In real-world dual-arm manipulation, this setting achieves 62.7/100 versus 49.9 for Raw-only and reduces instruction violations from 34% to 12%.
Key experimental results. Fine-grained and raw instructions are complementary, with the optimal FG:Raw ratio around 1:1 across simulation and real-world settings.
FineVLA-Tool transforms coarse goal-level instructions into detailed, step-by-step process descriptions verified by human annotators. Select an example below to see how a short task instruction is expanded into fine-grained action steps.
Trajectories
47,159
Source Datasets
10
Total Steps
220,606
Annotation Dimensions
10
We compare RoboFine-VLM (fine-tuned on FineVLA-Data) against leading VLMs on fine-grained robotic video captioning. Select a sample to see how different models describe the same manipulation trajectory under the hard setting (vision-only, no task instruction).
| Model | Caption Score | Consistency | Coverage | Anti-Hallucination |
|---|
RoboFine-Bench evaluates fine-grained robotic video understanding through two tracks: Caption Evaluation (measuring consistency, coverage, and anti-hallucination against atomic facts) and VQA (discriminative question-answering across 10 dimensions). Explore examples below to see how ground-truth annotations are decomposed into atomic facts and converted into evaluation questions.
Videos
500
Atomic Facts
10,816
VQA Pairs
1,030
Dimensions
10
Click each dimension to expand its ground-truth atomic facts.
These questions test fine-grained understanding across different capability dimensions.
FineVLA-Policy is trained with controlled mixtures of fine-grained and raw goal-level instructions. The mixed supervision (FG:Raw = 1:1) achieves the best results, demonstrating that fine-grained and raw instructions are complementary.
Simulation Demo
Real-world demo coming soon
| Setting | FG:Raw Ratio | Avg. Score (/100) | Instruction Violations |
|---|---|---|---|
| Raw-only | 0:1 | 49.9 | 34% |
| FG-only | 1:0 | 55.3 | 18% |
| Mixed (Best) | 1:1 | 62.7 | 12% |
@article{hu2025finevla,
title={FineVLA: Fine-Grained Instruction Alignment for Steerable Vision-Language-Action Policies},
author={Hu, Xintong and Huang, Xuhong and Zhang, Jinyu and Yao, Yutong and Sun, Yuchong and Wang, Qiuyue and Li, Mingsheng and Xie, Sicheng and Liu, Yitao and Chen, Junhao and Chen, Yixuan and Zheng, Yingming and Bai, Shuai and Yu, Tao},
journal={arXiv preprint},
year={2025}
}
We thank all contributors and annotators who participated in the construction and verification of FineVLA-Data and RoboFine-Bench.