We design SOLE-R1 to perform video-native temporal reasoning for goal-conditioned tasks. Given a natural-language goal and a video stream of observations, the model produces (i) a per-timestep CoT explanation describing what has changed since the last timestep and what remains to be done, and (ii) a dense scalar progress estimate used as a reward signal for online RL.
SOLE-R1 produces (i) multi-frame CoT explanations grounded in visual evidence and (ii) a dense progress signal suitable for online RL.
To elicit robust reasoning, we build the training data in two stages: (1) foundational reasoning over space (single-image + depth) and time (multi-image/video), and (2) robot-video spatiotemporal reasoning specialized for dense progress estimation.
We generate over one million CoT reasoning examples from more than 40,000 real-world and simulated videos.
We also carefully curate a diverse collection of general spatial and multi-frame temporal reasoning data to serve as a foundational layer of our training mixture.
Together, this training induces video-native reasoning that explicitly integrates both spatial and temporal structure (Figure 2).
We evaluate whether SOLE-R1 can serve as the sole supervision signal for learning manipulation skills from scratch via online RL.
We run experiments across four simulation benchmark suites (RoboSuite, ManiSkill, Meta-World, and LIBERO) and in a real-world tabletop manipulation setting with a Franka arm.
Across all settings, we evaluate a total of 41 tasks, spanning pick-and-place, articulation, button/lever/knob interactions, and mobile manipulation.
All hyperparameters and details are provided in Appendix G. All video demos available at the https://sole-r1.github.io/.
We use a SERL implementation of DrQv2 as the learning algorithm. The policy observes two RGB streams (a wrist camera and an external/shoulder camera) along with robot proprioception.
Actions are end-effector delta motions and a gripper open/close command.
We do not use any additional privileged state, depth, object poses, or task-specific sensors.
Unlike prior work that (i) learns from ground-truth rewards and/or (ii) tunes reward models or policies on task demonstrations, we evaluate in a fully zero-shot online RL setting:
SOLE-R1 achieves at least 50% success on 24 tasks, substantially outperforming all baselines (Figure 3). The strongest baselines include GPT-5 and Gemini, but they reach 50% success on only 7 and 5 tasks, respectively. The non-reasoning models achieve near-zero success on most tasks, with the exception of ReWiND in Meta-World, where it achieves higher success since it is trained on hundreds of Meta-World demonstrations
SOLE-R1 generalizes to unseen tasks and environments.
SOLE-R1 succeeds with tasks that significantly differ from the task types seen during training, such as sliding a puck into a net, opening and closing windows, and manipulating unseen levers and handles in novel ways based on the natural language task specification.
This suggests that SOLE-R1 does not merely memorize task templates, but instead learns reusable spatiotemporal progress primitives (e.g., establishing contact, aligning a grasp, changing articulation state, placing/settling objects) that transfer to unseen tasks.
SOLE-R1 generalizes to unseen embodiments and camera viewpoints.
SOLE-R1 solves tasks with the Franka, along with embodiments not seen during training, including the Sawyer robot in Meta-World, the WidowX AI and Fetch Mobile Manipulator in ManiSkill, and the modified Franka with different gripper fingers and wrist camera angle in real-world.
We also see SOLE-R1 solve tasks with camera views that were not used during training. This indicates that SOLE-R1 reward predictions are not narrowly tied to a particular kinematic chain or gripper appearance, but instead track goal-relevant object state changes across morphology and camera placement.
We analyze failures at two levels: (i) task-level reward pathologies (is the reward exploitable, miscalibrated, or too weak to drive learning?) and (ii) frame-level reasoning errors (what perceptual/temporal evidence is missed when progress is incorrect).
We use the perceived-vs-true success plot (Figure 4) to separate failures into two types: reward-hacking (high perceived, low true) versus signal-limited (low perceived, low true).
General-purpose VLM rewarders (GPT-5 and Gemini) predominantly fail via reward hacking: online RL discovers behaviors that elicit inflated progress predictions without completing the task.
We show an example of reward hacking with picking up the cube in Figure 4 and an extended set of examples in Figure ??. SOLE-R1 failures more often fall into the signal-limited failure type, suggesting the model typically recognizes non-success but can still provide rewards that are too flat/noisy to bootstrap exploration within the episode budget (as shown in the correlation analyses
between predicted and true rewards in Appendix C).
Qualitative review of rollouts highlights three recurring SOLE-R1 error modes: (1) temporal under-detection of brief events (contact, latch release, button actuation, insertion “click”), especially when they occur between reward-query steps or under occlusion; (2) ambiguous object state
in clutter/partial views (uncertain grasp, insertion, or seating), where conservative progress reduces hacking but weakens stepping-stone reinforcement; and (3) occasional goal-consistent appearance shortcuts (e.g., proximity/alignment scored as partial progress without completion), typically saturating at moderate progress instead of full completion.
We find that our data synthesis and training recipe follows a scaling law driven by the diversity of training tasks (Figure 6). We train variants of SOLE-R1 with an increasing number of task types included in our training data synthesis (details in Appendix M). The figure below plots the number of downstream tasks that achieve different success thresholds as a function of training task diversity.