Logo BridgeVLA

Input-Output Alignment for Efficient 3D Manipulation Learning with Vision-Language Models

🏆 COLOSSEUM Challenge Champion @ CVPR 2025 GRAIL Workshop

Peiyan Li,1,2,3,† Yixiang Chen,1,3 Hongtao Wu,2,†,* Xiao Ma,2,† Xiangnan Wu,1
Yan Huang,1,3,4 Liang Wang,1,3 Tao Kong,2 Tieniu Tan,1,3,5,*
†Project Lead, *Corresponding Author
1CASIA, 2Bytedance Seed, 3UCAS, 4FiveAges, 5NJU

TL;DR:

We propose a 3D VLA framework that aligns the input and output within a shared 2D space in both pre-training and fine-tuning, enabling strong data efficiency and achieves impressive performance in both basic and generalization settings.

Abstract

Recently, leveraging pre-trained vision-language models (VLMs) for building vision-language-action (VLA) models has emerged as a promising approach to effective robot manipulation learning. However, existing methods predominantly process 2D inputs, excluding the valuable 3D information. While some recent studies suggest injecting 3D signals into the VLM for action prediction, they overlook the spatial structure inherent in 3D data, leading to low sample efficiency. In this paper, we introduce BridgeVLA, a novel 3D VLA model that (1) project 3D inputs to multiple 2D images, ensuring input alignment with the VLM backbone, and (2) utilize 2D heatmaps for action prediction, unifying the input and output spaces within a consistent 2D image space. In addition, we propose a scalable pre-training method that equips the VLM backbone with the capability to predict 2D heatmaps for object grounding before downstream policy learning. Extensive experiments show the proposed method is able to learn 3D manipulation efficiently and effectively. BridgeVLA outperforms the state-of-the-art baselines across multiple benchmarks. In RLBench, it attains a substantially higher success rate (88.2% vs. 81.4%). In COLOSSEUM, it demonstrates significantly better performance (64.0% vs. 56.7%) in the challenge generalization scenes. In GemBench, it is the only method that achieves a 50% average success rate across all four evaluation settings. In real-robot experiments, BridgeVLA outperforms the state-of-the-art baseline method by 32% on average, and is able to generalize robustly in multiple out-of-distribution settings, including visual disturbance and unseen instructions. Remarkably, it is able to achieve a success rate of 96.8% on 10+ tasks with with only 3 trajectories per task, highlighting its extraordinary sample efficiency.
grade-lv

Figure 1: Overview of BridgeVLA

Method

As illustrated in Fig. 2, BridgeVLA employs a dual-phase training recipe.

2D Heatmap Pre-training

We train BridgeVLA on 2D object detection datasets. The model takes as input an image and language describing the target object and outputs a 2D heatmap which highlights regions of interest that correspond to the target object.

3D Action Fine-tuning

The model takes as input three orthographic projection images of a 3D point cloud and a language instruction. It outputs three 2D heatmaps, which highlight the position of the end-effector in the next keyframe across all three views. For the remaining action components, it uses an MLP to aggregate the image feature tokens to predict the rotation action, gripper action, and collision flag of the next keyframe.

grade-lv

Figure 2: Network architecture of BridgeVLA

Simulation Experiments

RLBench

To test our model's ability to deal with complex manipulation tasks, we evaluate BridgeVLA on RLBench, a benchmark implemented in CoppeliaSim using a Franka Panda robot mounted with a parallel-jaw gripper. Totally, we choose 18 tasks from RLBench, and each task is provided with 100 expert demonstrations. And each demonstration is paired with language instruction and multiple keyframes. Models are evaluated via binary success rates over 25 trials per task, with a maximum of 25 action steps per trial. The results are shown in Tab. 1.

grade-lv

Table 1: Results on RLBench.

RLBench Task Demonstrations

Stack 4 Rose Blocks

Put Chocolate Jello in Cupboard

Place 3 Cups on Cup Holder

Close Blue Jar

Sweep Dirt to Dustpan

Screw in Lime Light Bulb

Put Money in Safe

Take Steak off Grill

Stack Cups on Navy Cup

COLOSSEUM

To systematically evaluate the generalization capabilities of BridgeVLA, we further evaluate on the COLOSSEUM benchmark. The COLOSSEUM benchmark is an extension to the RLBench benchmark. The model is trained on the data from the original RLBench benchmark but evaluated in environments spanning 12 axes of perturbations. These perturbations, which are unseen during training, encompass changes in object texture, color, and size, backgrounds, lighting, distractors and camera poses. Specifically, our evaluation includes three steps: 1) train the model with the original RLBench data without perturbations (100 trajectories per task) on 20 tasks, 2) evaluate each task over 25 trials per perturbation, 3) compute the average success rate of all evaluated tasks for every perturbation. Besides the 12 types of perturbations, we also evaluate on basic variations from the original RLBench (denoted as RLBench in Tab. 2), and a more challenging setting which combines all the 12 types of perturbations (denoted as All Perturbations in Tab. 2).

grade-lv

Table 2: Results on COLOSSEUM.

COLOSSEUM Task Demonstrations

Scoop with Spatula

Insert onto Square Peg

Close Laptop Lid

Move Hanger

Basketball in Hoop

Reach and Drag

Straighten Rope

Turn Oven On

Hockey Hit

GemBench

To more comprehensively assess the generalization ability of our proposed method, we evaluate BridgeVLA on GemBench, a comprehensive hierarchical benchmark built on the RLBench simulator. GemBench is designed to rigorously test 3D manipulation policies across a wide range of scenarios, covering 60 tasks and 123 variations organized into four progressively challenging levels—from simple novel placements to complex multi-step tasks involving novel object shapes and articulations. BridgeVLA demonstrates outstanding performance, achieving the highest average success rate across all evaluation levels. Notably, it sets new state-of-the-art results in both the L2 (novel rigid objects) and L3 (novel articulated objects) settings, where generalization to unseen object shapes and part combinations is particularly challenging. We show the results in Tab. 3.

grade-lv

Table 3: Results on GemBench.

GemBench Task Demonstrations

Stack Blocks

Stack Cups

Put Money in Safe

Close Laptop Lid

Close Microwave

Close Grill

Push Button

Take Shoe Out of Box

Toilet Seat Up

Real-World Experiments

We conduct real-robot experiments to further verify our model's performance. A total of 13 manipulation tasks are evaluated, covering a spectrum from simple pick-and-place to long-horizon behaviors such as opening a drawer and placing objects inside.

Evaluation Settings

To systematically evaluate the robustness and generalization of BridgeVLA, we design seven distinct settings:

Basic: Tasks are evaluated under the same environmental conditions as training. This setting serves as a sanity check to ensure the model performs well under familiar conditions.

Distractor: Visually similar but irrelevant objects are added to the workspace. These distractors share shape or color characteristics with target objects, testing the model's ability to distinguish targets amid ambiguity.

Lighting: The robot operates under significantly different illumination, such as turning off overhead lights, to test robustness to lighting changes that affect the appearance of the scene.

Background: The visual background is altered by changing tablecloths (three variants in total), assessing the model's invariance to background textures and colors.

Height: All manipulable objects are placed on a raised surface (a drawer 9.5 cm above the base level), requiring the model to adapt its control to new object elevations.

Combination: Novel combinations of known objects and skills are introduced. While both the objects (e.g., a red block and green plate) and the manipulation skill (e.g., place A in B) are present in the training data, the pairing is new (e.g., "place the red block in the green plate"), challenging the model to generalize to unseen instructions.

Category: The model is asked to manipulate entirely unseen object categories during training. Seven such objects are introduced to evaluate whether pretrained visual-linguistic knowledge enables effective zero-shot generalization.

Results

In addition to the full model, we compare against RVT-2, the strongest simulation baseline, and an ablated variant of BridgeVLA without our proposed 2D-heatmap pretraining. The results are shown in Fig. 4. BridgeVLA outperforms both baselines in six of the seven settings, and maintains high robustness under visual disturbances (Distractor, Background).

Furthermore, it achieves a 96.8% success rate in the Basic setting even with only 3 training trajectories per task, showing remarkable data efficiency. The results also validate the effectiveness of our 2D-heatmap pretraining in connecting language and vision for compositional task generalization.

grade-lv

Figure 4: Real-robot setup and results.

Background

Place red block in blue plate

Put RedBull can on top shelf

Put zebra in upper drawer

Lighting

Put RedBull can on bottom shelf

Press sanitizer

Put giraffe in lower drawer

Distractor

Place orange block in green plate

Put giraffe in lower drawer

Press sanitizer

Combination

Put orange block in lower drawer

Put RedBull can in green plate

Place yellow block in purple plate

Height

Press sanitizer

Put soda can on bottom shelf

Place red block in blue plate

Category

Put peach on bottom shelf

Place bottle in blue plate

Put panda in lower drawer

Failure Cases

Although our method outperforms baseline methods in the Category setting, its absolute success rate is not high. We show some failure cases below.

Place bread in green plate

Put apple on top shelf

Put peach on bottom shelf

We believe this relatively low performance is not due to BridgeVLA forgetting the knowledge gained from pre-training, as it still predicts heatmaps accurately when provided with samples from the pre-training dataset after fine-tuning (see Fig.5). Instead, we hypothesize that the reduced performance stems from two factors: 1) The images in the pre-training dataset are captured from third-person views, which differ significantly from the projection images in our robot data; 2) The pre-training task focuses solely on object localization, whereas manipulation involves predicting keypoints that do not correspond to an object. Exploring how to fully utilise such preserved pre-training knowledge for manipulation is an interesting direction for future research.

grade-lv

Figure 5: Prediction on pre-training data after fine-tuning.

Citation


@misc{li2025bridgevla,
    title={BridgeVLA: Input-Output Alignment for Efficient 3D Manipulation Learning with Vision-Language Models},
    author={Peiyan Li and Yixiang Chen and Hongtao Wu and Xiao Ma and Xiangnan Wu and Yan Huang and Liang Wang and Tao Kong and Tieniu Tan},
    year={2025},
    eprint={2506.07961},
    archivePrefix={arXiv},
    primaryClass={cs.RO}
}