FineVLA: Fine-Grained Instruction Alignment for Steerable Vision-Language-Action Policies

Abstract

Vision-Language-Action (VLA) models are increasingly expected to not only complete robot tasks, but also follow human instructions about how those tasks should be executed. However, existing robot datasets usually pair trajectories with coarse goal-level language, leaving execution-critical details such as active arm, approach direction, and contact region unspecified. This limits steerable policy learning and robotic video understanding.

We introduce FineVLA, an open framework for action-aligned fine-grained VLA supervision. The framework includes: (1) a data construction tool that unifies 972,247 trajectories across 85K tasks from 10 open-source robot datasets and builds FineVLA-Data, a human-verified dataset of 47,159 fine-grained trajectories; (2) a held-out benchmark with 500 videos, 11,631 atomic facts, and 1,030 VQA questions; (3) a robotics-specialized VLM annotator for scalable fine-grained annotation; and (4) a steerable VLA policy trained with controlled mixtures of fine-grained and raw goal-level instructions.

Our policy experiments yield three findings. First, fine-grained supervision does not sacrifice goal-level task success: FG-only improves over Raw-only by +1.4 to +8.1 success-rate points. Second, fine-grained and raw instructions are complementary: performance follows a consistent inverted-U trend, peaking around FG:Raw = 1:2 to 1:1. The strongest mixed setting reaches 86.8%/82.5% in RoboTwin simulation and 62.7/100 in real-world dual-arm manipulation, compared with 49.9 for Raw-only. Third, fine-grained supervision directly improves steerable control: in real-world evaluation, the largest gains over Raw-only appear on pose (+23), color (+18), and approach direction (+18)—factors where goal-level instructions provide no guidance.

FineVLA-Tool

FineVLA-Tool converts large-scale heterogeneous robot datasets into fine-grained, action-aligned instruction supervision. Its design addresses three practical bottlenecks in open robot data: (1) inconsistent action/state formats across datasets, (2) heavy redundancy among demonstrations of the same task, and (3) sparse, task-level instruction annotations.

Starting from 972,247 trajectories across 10 source datasets, FineVLA-Tool operates in four stages: Stage 1 converts raw trajectories into a unified LeRobot-style format and filters out invalid recordings; Stage 2 canonicalizes action and state representations across embodiments and removes corrupted or inconsistent trajectories via an action-state consistency quality gate; Stage 3 applies DTW-based similarity computation and hierarchical clustering to identify representative trajectories, reducing redundancy while preserving diverse manipulation strategies; Stage 4 decomposes selected trajectories into step-level descriptions and annotates them with a ten-dimensional fine-grained schema, followed by human verification.

The resulting FineVLA-Data is a human-verified corpus of 47,159 representative trajectories with process-level supervision. The average instruction length increases 10.4× (from 9.3 to 96.8 words), covering 47 unique action verbs across all sources.

Pipeline of FineVLA-Tool. Stage 1: raw trajectories from 10 open-source robot datasets are converted into a unified LeRobot-style format and filtered to remove invalid videos. Stage 2: action and state representations are canonicalized across embodiments, and an action-state consistency quality gate removes corrupted or inconsistent trajectories. Stage 3: DTW-based similarity computation and clustering identify representative trajectories, reducing redundancy while preserving diverse manipulation strategies. Stage 4: selected trajectories are decomposed into step-level descriptions and annotated with a ten-dimensional fine-grained schema, followed by human verification.

FineVLA-Data

FineVLA-Tool transforms coarse goal-level instructions into detailed, step-by-step process descriptions verified by human annotators. Browse examples below to see how short task instructions are expanded into fine-grained action steps.

Trajectories

47,159

Source Datasets

10

Total Steps

220,606

Dimensions

10

FineVLA-Data statistics. Fine-grained annotations dramatically increase instruction information density compared to original coarse instructions across all sources.

Source	Trajectories	Steps	Avg. Words (Coarse)	Avg. Words (FG)	Density ↑
BridgeData-V2	4,958	21,554	10.1	61.7	6.1×
BC-Z	1,513	5,313	5.2	51.2	9.8×
RT-1	5,232	22,023	6.8	61.4	9.1×
Galaxea	2,834	18,484	4.7	219.9	47.1×
RoboMIND-V1	4,605	20,341	8.6	72.8	8.5×
RoboMIND-V2	7,119	39,166	6.6	98.8	14.9×
RoboCOIN	8,513	43,926	16.1	122.6	7.6×
RH20T	1,387	5,560	7.9	92.1	11.7×
RDT	1,275	8,437	16.9	114.0	6.7×
DROID	9,723	35,802	8.0	90.9	11.3×
Total	47,159	220,606	9.3	96.8	10.4×

Browse Examples

Scroll horizontally to explore trajectories. Hover a card to pause and inspect.

RoboFine-VLM

RoboFine-VLM is fine-tuned from Qwen3.5-397B on FineVLA-Data. It serves as a scalable annotator for new trajectories, achieving 68.2% VQA accuracy, 83.2% captioning score on the easy setting, and 82.2% captioning score on the hard setting, outperforming GPT-5.4 and Gemini 3.1 Pro.

Caption Quality Comparison

We compare RoboFine-VLM (fine-tuned on FineVLA-Data) against leading VLMs on fine-grained robotic video captioning under the hard-mode setting, where models generate captions without task instruction input. Select a sample to inspect synchronized multi-view videos and compare how each model describes the same manipulation trajectory.

Task Instruction:

RoboFine-Bench

RoboFine-Bench evaluates fine-grained robotic video understanding with 500 videos, 11,631 human-reviewed atomic facts, and 1,030 VQA questions spanning all ten fine-grained dimensions. The benchmark includes complementary VQA and caption tracks, with all trajectories held out from both VLM fine-tuning and policy training.

Benchmark Leaderboard

Rank	Model	Overall ▼	AA	TO	IC	AS	C&A	T&O	BM	OI	FC	F&R

Benchmark Explorer

RoboFine-Bench evaluates fine-grained robotic video understanding through two tracks: Caption Evaluation (measuring consistency, coverage, and anti-hallucination against atomic facts) and VQA (discriminative question-answering across 10 dimensions). Explore examples below to see how ground-truth annotations are decomposed into atomic facts and converted into evaluation questions.

Videos

500

Atomic Facts

11,631

VQA Pairs

1,030

Dimensions

10

GT (Fine-Grained instruction):

Capability Dimensions Covered

Atomic Facts Breakdown

Click each dimension to expand its ground-truth atomic facts.

VQA Questions

These questions test fine-grained understanding across different capability dimensions.

Experiments & Results

Key findings.

Fine-grained supervision does not sacrifice goal-level task success. FG-only consistently outperforms Raw-only across all three RoboTwin settings, with gains from +1.4 to +8.1 points.
Raw and fine-grained instructions are complementary. Performance follows a clear inverted-U as FG ratio increases, with the best results around FG:Raw = 1:2 to 1:1. The strongest mixed setting reaches 86.8%/82.5% on AlohaMix-OFT Easy/Hard and 62.7/100 in real-world evaluation.
Fine-grained supervision is a scalable supervision axis. On the same RDT data, the OFT-vs-GR00T gap shrinks from 6.4/6.6 points under Raw-only to just 0.8/0.5 under FG-only (Easy/Hard). Its gains also grow with data scale: FG-only improves over Raw-only by only +1.4/+2.0 on RDT-OFT, but by +6.5/+4.7 on AlohaMix-OFT, showing that dense action-aligned language reduces supervision bottlenecks rather than only helping a single setup.
Fine-grained language improves factor-level steerability. Especially on attributes underspecified by goal-level instructions: pose improves from 24 to 47, color from 22 to 40, and approach from 60 to 78 under FG:Raw = 1:1.

Key experimental results. Fine-grained and raw instructions are complementary, with the optimal FG:Raw ratio around 1:2 to 1:1 across simulation and real-world settings.

FineVLA-Policy: Steerable Control Results

The results below unpack the same conclusion across both simulation and real-world control: fine-grained language improves execution compliance without hurting goal completion, and the best behavior emerges when fine-grained annotations augment rather than replace raw goal-level instructions. In practice, FG:Raw = 1:1 delivers the strongest overall trade-off between task success and steerable factor grounding.

Simulation Success Rate Results

FG:Raw	RDT-OFT		RDT-GR00T		AlohaMix-OFT
FG:Raw	Easy	Hard	Easy	Hard	Easy	Hard
Raw-only	61.5	60.0	55.1	53.4	71.8	71.4
FG:Raw = 1:4	68.2	66.5	58.2	55.7	75.3	74.3
FG:Raw = 1:2	74.1	72.1	61.7	60.9	82.8	78.6
FG:Raw = 1:1	73.9	72.4	69.4	68.2	86.8	82.5
FG:Raw = 2:1	70.4	68.3	65.9	63.1	80.9	79.3
FG:Raw = 4:1	68.6	67.5	64.9	63.2	79.5	78.5
FG-only	62.9	62.0	62.1	61.5	78.3	76.1

Simulation Comparison: FG:Raw=1:1 vs Raw-only

Same initial conditions, same episode. Green = success, Red = failure.

Real-World Steerability Results

Supervision	In-Distribution Tasks							OOD	Average
Supervision	Clean Table	Stack Block	Color	Pose	Approach	Rotate	Arm	L→R^†	(ID)	(All)
Raw-only	72	35	22	24	60	76	60	0	49.9	43.6
FG:Raw = 1:4	76	36	28	32	65	79	61	0	53.9	47.1
FG:Raw = 1:2	79	39	36	48	76	87	63	5	61.1	54.1
FG:Raw = 1:1	84	40	40	47	78	86	64	10	62.7	56.1
FG:Raw = 2:1	80	38	34	42	72	83	62	5	58.7	52.0
FG:Raw = 4:1	74	37	31	43	72	83	62	5	57.4	50.9
FG-only	70	35	25	41	70	80	60	0	54.4	47.6

^† L→R: use left hand to place into right bowl — unseen actor-target combination (OOD compositional probe).

Real-World Task Demonstrations

Representative task demonstrations highlighting the fine-grained distinctions in our real-world evaluation. Each pair shares the same scene but varies in a single attribute (direction, hand, object state, color, or rotation).

BibTeX

@article{hu2026finevla,
  title={FineVLA: Fine-Grained Instruction Alignment for Steerable Vision-Language-Action Policies},
  author={Hu, Xintong and Huang, Xuhong and Zhang, Jinyu and Yao, Yutong and Sun, Yuchong and Wang, Qiuyue and Li, Mingsheng and Xie, Sicheng and Liu, Yitao and Chen, Junhao and others},
  journal={arXiv preprint arXiv:2605.27284},
  year={2026}
}

Acknowledgements

We thank all contributors and annotators who participated in the construction and verification of FineVLA-Data and RoboFine-Bench. We also thank Haoqi Yuan, Delin Chen, Jinhui Ye, Zicheng Gong, Zhewei Kang, and Haoyuan Wu for their helpful discussions.