FineVLA: Fine-Grained Instruction Alignment for Steerable Vision-Language-Action Policies

1XLANG Lab, The University of Hong Kong, 2Qwen Team, Alibaba Inc.
*Equal Contribution, Corresponding Author
FineVLA Overview

FineVLA builds a closed loop for action-instruction alignment, connecting fine-grained data construction, robotic video understanding, scalable annotation, and steerable VLA policy learning.

Abstract

Vision-Language-Action (VLA) models are moving beyond task-level robot control toward policies that can be steered by human instructions. However, existing robot datasets pair trajectories with coarse goal-level language, leaving execution details—active arm, target object, approach direction, contact region, motion path, and final configuration—unspecified. This limits both policy learning and robotic video understanding, and the field still lacks scalable infrastructure, dedicated benchmarks, and training recipes for action-aligned fine-grained supervision.

We introduce FineVLA, a fully open-source framework that closes the loop between fine-grained data construction, robotic video understanding, scalable annotation, and steerable VLA policy learning. FineVLA includes: (1) FineVLA-Tool & FineVLA-Data, which unifies 972,247 trajectories from 10 datasets and constructs 47,159 human-verified fine-grained trajectories; (2) RoboFine-Bench, a held-out benchmark with 500 videos, 10,816 atomic facts, and 1,030 VQA questions; (3) RoboFine-VLM, a robotics-specialized VLM annotator; and (4) FineVLA-Policy, a steerable policy trained with controlled mixtures of fine-grained and raw goal-level instructions.

News

  • [2025.05] Paper released. Check out our arXiv preprint.
  • [2025.05] FineVLA-Data and RoboFine-Bench are coming soon.
  • [2025.05] Code and models will be released soon. Stay tuned!

Method

FineVLA-Tool converts large-scale heterogeneous robot demonstrations into fine-grained, action-aligned instruction data through four stages: (1) aggregate 972,247 trajectories from 10 open-source datasets into a unified format; (2) canonicalize action representations and remove corrupted trajectories; (3) apply DTW-based clustering to select representative demonstrations; and (4) annotate with a ten-dimensional fine-grained schema followed by human verification.

FineVLA-Tool Pipeline

FineVLA-Data & RoboFine-Bench

FineVLA-Data is a human-verified corpus of 47,159 fine-grained action-aligned trajectories drawn from 10 source datasets, with a 10.4× increase in instruction information density.

RoboFine-Bench evaluates fine-grained robotic video understanding through complementary VQA and captioning tracks, containing 500 videos, 10,816 atomic facts across 10 dimensions, and 1,030 VQA questions.

Experiments & Results

On RoboFine-Bench, RoboFine-VLM achieves 71.0% VQA accuracy and 83.6% captioning score, outperforming GPT-5.4 and Gemini 3.1 Pro. In RoboTwin simulation, the optimal FG:Raw = 1:1 setting reaches 86.8%/82.5% on AlohaMix-OFT Easy/Hard (+15.0/+11.1 over Raw-only). In real-world dual-arm manipulation, this setting achieves 62.7/100 versus 49.9 for Raw-only and reduces instruction violations from 34% to 12%.

Experimental Results

Key experimental results. Fine-grained and raw instructions are complementary, with the optimal FG:Raw ratio around 1:1 across simulation and real-world settings.

Interactive Demos

FineVLA-Data: Fine-Grained Annotation Results

FineVLA-Tool transforms coarse goal-level instructions into detailed, step-by-step process descriptions verified by human annotators. Select an example below to see how a short task instruction is expanded into fine-grained action steps.

Trajectories

47,159

Source Datasets

10

Total Steps

220,606

Annotation Dimensions

10

Original Annotation

Human-Reviewed Fine-Grained Steps

    RoboFine-VLM: Caption Quality Comparison

    We compare RoboFine-VLM (fine-tuned on FineVLA-Data) against leading VLMs on fine-grained robotic video captioning. Select a sample to see how different models describe the same manipulation trajectory under the hard setting (vision-only, no task instruction).

      Task Instruction (GT):

      Captioning Scores on RoboFine-Bench (Hard Setting, %)

      Model Caption Score Consistency Coverage Anti-Hallucination

      RoboFine-Bench: Benchmark Explorer

      RoboFine-Bench evaluates fine-grained robotic video understanding through two tracks: Caption Evaluation (measuring consistency, coverage, and anti-hallucination against atomic facts) and VQA (discriminative question-answering across 10 dimensions). Explore examples below to see how ground-truth annotations are decomposed into atomic facts and converted into evaluation questions.

      Videos

      500

      Atomic Facts

      10,816

      VQA Pairs

      1,030

      Dimensions

      10

        Task:
        Capability Dimensions Covered
        Atomic Facts Breakdown

        Click each dimension to expand its ground-truth atomic facts.

        VQA Questions

        These questions test fine-grained understanding across different capability dimensions.

        FineVLA-Policy: Steerable Control Results

        FineVLA-Policy is trained with controlled mixtures of fine-grained and raw goal-level instructions. The mixed supervision (FG:Raw = 1:1) achieves the best results, demonstrating that fine-grained and raw instructions are complementary.

        Simulation Demo

        Real-world demo coming soon

        Real-World Steerability Results

        Setting FG:Raw Ratio Avg. Score (/100) Instruction Violations
        Raw-only 0:1 49.9 34%
        FG-only 1:0 55.3 18%
        Mixed (Best) 1:1 62.7 12%

        BibTeX

        @article{hu2025finevla,
          title={FineVLA: Fine-Grained Instruction Alignment for Steerable Vision-Language-Action Policies},
          author={Hu, Xintong and Huang, Xuhong and Zhang, Jinyu and Yao, Yutong and Sun, Yuchong and Wang, Qiuyue and Li, Mingsheng and Xie, Sicheng and Liu, Yitao and Chen, Junhao and Chen, Yixuan and Zheng, Yingming and Bai, Shuai and Yu, Tao},
          journal={arXiv preprint},
          year={2025}
        }

        Acknowledgements

        We thank all contributors and annotators who participated in the construction and verification of FineVLA-Data and RoboFine-Bench.