SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning

Philip Schroeder^{1, 2} Thomas Weng² Karl Schmeckpeper² Eric Rosen² Stephen Hart² Ondrej Biza²

¹MIT
²RAI Institute

Description:

Abstract

Vision-language models (VLMs) have shown impressive capabilities across diverse tasks, motivating efforts to leverage these models to supervise robot learning. However, when used as evaluators in reinforcement learning (RL), today’s strongest models often fail under partial observability and distribution shift, enabling policies to exploit perceptual errors rather than solve the task. We introduce SOLE-R1 (Self-Observing LEarner), a video-language reasoning model explicitly designed to serve as the sole reward signal for online RL. Given only raw video observations and a natural-language goal, SOLE-R1 performs per-timestep spatiotemporal chain-of-thought (CoT) reasoning and produces dense estimates of task progress that can be used directly as rewards. To train SOLE-R1, we develop a large-scale video trajectory and reasoning synthesis pipeline that generates temporally grounded CoT traces aligned with continuous progress supervision. This data is combined with foundational spatial and multi-frame temporal reasoning, and used to train the model with a hybrid framework that couples supervised fine-tuning with RL from verifiable rewards. Across four different simulation environments and a real-robot setting, SOLE-R1 enables zero-shot online RL from random initialization: robots learn previously unseen manipulation tasks without ground-truth rewards, success indicators, demonstrations, or task-specific tuning. SOLE-R1 succeeds on 24 unseen tasks and substantially outperforms strong vision-language rewarders, including GPT-5 and Gemini-3-Pro, while exhibiting markedly greater robustness to reward hacking. All video demos available at the https://sole-r1.github.io/.

Model Overview

SOLE-R1 (Self-Observing LEarner)

We design SOLE-R1 to perform video-native temporal reasoning for goal-conditioned tasks. Given a natural-language goal and a video stream of observations, the model produces (i) a per-timestep CoT explanation describing what has changed since the last timestep and what remains to be done, and (ii) a dense scalar progress estimate used as a reward signal for online RL.

SOLE-R1 produces (i) multi-frame CoT explanations grounded in visual evidence and (ii) a dense progress signal suitable for online RL. To elicit robust reasoning, we build the training data in two stages: (1) foundational reasoning over space (single-image + depth) and time (multi-image/video), and (2) robot-video spatiotemporal reasoning specialized for dense progress estimation.

We generate over one million CoT reasoning examples from more than 40,000 real-world and simulated videos. We also carefully curate a diverse collection of general spatial and multi-frame temporal reasoning data to serve as a foundational layer of our training mixture. Together, this training induces video-native reasoning that explicitly integrates both spatial and temporal structure (Figure 2).

Experiments

We evaluate whether SOLE-R1 can serve as the sole supervision signal for learning manipulation skills from scratch via online RL. We run experiments across four simulation benchmark suites (RoboSuite, ManiSkill, Meta-World, and LIBERO) and in a real-world tabletop manipulation setting with a Franka arm. Across all settings, we evaluate a total of 41 tasks, spanning pick-and-place, articulation, button/lever/knob interactions, and mobile manipulation. All hyperparameters and details are provided in Appendix G. All video demos available at the https://sole-r1.github.io/.

We use a SERL implementation of DrQv2 as the learning algorithm. The policy observes two RGB streams (a wrist camera and an external/shoulder camera) along with robot proprioception. Actions are end-effector delta motions and a gripper open/close command. We do not use any additional privileged state, depth, object poses, or task-specific sensors.

Unlike prior work that (i) learns from ground-truth rewards and/or (ii) tunes reward models or policies on task demonstrations, we evaluate in a fully zero-shot online RL setting:

No ground-truth rewards. The policy never observes ground-truth/external rewards (dense or sparse) and receives no success labels during training.
No demonstrations or offline trajectories. The policy starts with random actions and learns only from on-policy interaction.
No task-specific tuning or calibration. Reward models are used as-is, with fixed prompting across tasks.

1) SOLE-R1 enables zero-shot online RL from scratch

SOLE-R1 achieves at least 50% success on 24 tasks, substantially outperforming all baselines (Figure 3). The strongest baselines include GPT-5 and Gemini, but they reach 50% success on only 7 and 5 tasks, respectively. The non-reasoning models achieve near-zero success on most tasks, with the exception of ReWiND in Meta-World, where it achieves higher success since it is trained on hundreds of Meta-World demonstrations

SOLE-R1 generalizes to unseen tasks and environments. SOLE-R1 succeeds with tasks that significantly differ from the task types seen during training, such as sliding a puck into a net, opening and closing windows, and manipulating unseen levers and handles in novel ways based on the natural language task specification. This suggests that SOLE-R1 does not merely memorize task templates, but instead learns reusable spatiotemporal progress primitives (e.g., establishing contact, aligning a grasp, changing articulation state, placing/settling objects) that transfer to unseen tasks.

SOLE-R1 generalizes to unseen embodiments and camera viewpoints. SOLE-R1 solves tasks with the Franka, along with embodiments not seen during training, including the Sawyer robot in Meta-World, the WidowX AI and Fetch Mobile Manipulator in ManiSkill, and the modified Franka with different gripper fingers and wrist camera angle in real-world. We also see SOLE-R1 solve tasks with camera views that were not used during training. This indicates that SOLE-R1 reward predictions are not narrowly tied to a particular kinematic chain or gripper appearance, but instead track goal-relevant object state changes across morphology and camera placement.

2) SOLE-R1 is robust to the exploitation observed with GPT-5 and Gemini-3-Pro

We analyze failures at two levels: (i) task-level reward pathologies (is the reward exploitable, miscalibrated, or too weak to drive learning?) and (ii) frame-level reasoning errors (what perceptual/temporal evidence is missed when progress is incorrect).

We use the perceived-vs-true success plot (Figure 4) to separate failures into two types: reward-hacking (high perceived, low true) versus signal-limited (low perceived, low true). General-purpose VLM rewarders (GPT-5 and Gemini) predominantly fail via reward hacking: online RL discovers behaviors that elicit inflated progress predictions without completing the task. We show an example of reward hacking with picking up the cube in Figure 4 and an extended set of examples in Figure ??. SOLE-R1 failures more often fall into the signal-limited failure type, suggesting the model typically recognizes non-success but can still provide rewards that are too flat/noisy to bootstrap exploration within the episode budget (as shown in the correlation analyses between predicted and true rewards in Appendix C).

Qualitative review of rollouts highlights three recurring SOLE-R1 error modes: (1) temporal under-detection of brief events (contact, latch release, button actuation, insertion “click”), especially when they occur between reward-query steps or under occlusion; (2) ambiguous object state in clutter/partial views (uncertain grasp, insertion, or seating), where conservative progress reduces hacking but weakens stepping-stone reinforcement; and (3) occasional goal-consistent appearance shortcuts (e.g., proximity/alignment scored as partial progress without completion), typically saturating at moderate progress instead of full completion.

3) SOLE-R1 model and training recipe follows a scaling law driven by diversity of training tasks

We find that our data synthesis and training recipe follows a scaling law driven by the diversity of training tasks (Figure 6). We train variants of SOLE-R1 with an increasing number of task types included in our training data synthesis (details in Appendix M). The figure below plots the number of downstream tasks that achieve different success thresholds as a function of training task diversity.