Robot Trajectory Annotation: Challenges, Methods, and Quality Standards

Why Annotation Matters

A robot teleoperation session produces a trajectory: a time-indexed sequence of joint states, camera frames, and end-effector poses. This is data in the physical sense, but it is not yet training data in the sense that most policy learning algorithms require. Algorithms like ACT, diffusion policy, and behavior cloning need to know which episodes succeeded (to train on) and which failed (to exclude or weight down). Foundation model fine-tuning requires language instructions that describe what the robot was doing. Phase-conditioned policies require segment boundaries. Contact-learning algorithms require timestamps of first contact, stable grasp, and release events.

Annotation is the process of adding this structured information. It's time-consuming, requires human judgment, and is easy to do wrong in ways that degrade policy quality. A dataset with inconsistent success labels will produce a policy that has learned to imitate some failures. A dataset with vague language instructions will produce a policy that cannot generalize to instructions at inference time.

The Five Annotation Types

1. Task Success / Failure: Binary pass/fail is the minimum. For training purposes, binary is often sufficient, but a 0–100 partial credit score adds information: a grasp that succeeded but was slow and awkward is different from one that was fast and clean. Partial credit scores of 0–30 (clear failure), 31–69 (partial/marginal), 70–100 (success with quality gradient) provide better signal for curriculum learning and data weighting.
2. Language Instruction: "Pick up the red cup" is better than "pick up the object." "Grasp the cylindrical red plastic cup from its body, not the rim, and place it upright on the white tray" is better still for fine-grained tasks. The instruction should specify the task-relevant object attributes (color, shape, material where relevant), the desired grasp strategy if specific, and the goal state. Avoid object-name-only instructions ("pick up the cup") for datasets intended for foundation model fine-tuning — these are too low-information.
3. Segment Labels: Dividing each episode into phases (reach, grasp, transport, place) enables phase-conditioned policies and more targeted data analysis. Minimum four phases for standard pick-place tasks. Assembly tasks may require 8–12 segments. Segment boundaries should be marked at the video frame level, not just the action index.
4. Contact Event Timestamps: For contact-learning tasks (insertion, assembly, surface following), precise timestamps for first contact (gripper touches object), stable grasp (contact force stabilizes above threshold), and release (gripper opens, object free) are essential for learning contact-conditioned behaviors. Manual annotation error on contact timestamps should be <5 frames (167ms at 30fps). Automated detection from F/T sensor data is preferable where available.
5. Quality Scores: Beyond success/failure, per-episode quality metrics inform data weighting during training. Smoothness score (inverse of mean jerk over episode), success confidence (annotator's certainty about the success label), and difficulty estimate (for curriculum learning prioritization) are the three most useful.

Annotation Methods Comparison

Method	Accuracy	Cost	Throughput	Best For
Expert labeler (domain knowledge)	Highest (95%+)	High ($40–60/hr)	20–40 episodes/hr	Ground truth, gold standard, contact events
Trained crowdsource (MT, Scale)	Medium (80–90%)	Medium ($5–15/episode)	100–500/hr at scale	Success/failure, language, segment labels
Naive crowdsource (MTurk open)	Lower (70–80%)	Low ($1–5/episode)	High	Simple success/failure on clear tasks only
Trained classifier (CNN/LSTM)	Medium-high (88–93%)	Very low (compute only)	Thousands/hr	Success/failure at scale, auto-annotation
Active learning loop	High (improves with data)	Decreasing per label	High after warmup	Large datasets with expert budget constraint

Inter-Annotator Agreement

Cohen's kappa is the standard measure of inter-annotator agreement, adjusted for chance agreement. For robot annotation:

Kappa > 0.8 (strong agreement): The task definition and annotation protocol are clear enough that annotators apply them consistently. This is the target for production annotation pipelines. Most simple success/failure tasks reach this level with proper training.
Kappa 0.6–0.8 (moderate agreement): Requires attention. Force reconciliation meetings where annotators discuss disagreed cases and update the protocol until the definition of edge cases is explicit. Do not ship a dataset with kappa in this range without reconciliation.
Kappa < 0.6 (poor agreement): The task definition is ambiguous. Stop annotation, redesign the protocol with clearer success criteria, and re-annotate from scratch. Training on data annotated with kappa < 0.6 produces policies with inconsistent behavior that is extremely difficult to debug.

Automated Annotation

Two automated annotation approaches are production-ready:

Success Classifier CNN: Fine-tuned ResNet-50 on the final 10 frames of each episode, binary success/failure output. SVRC's internal classifier achieves 92% accuracy on held-out test sets across standard manipulation tasks. Requires 100+ labeled examples per task to train reliably. Use for large datasets after human-labeled training set is established.
Segment Detector (HMM on joint velocity + F/T): Hidden Markov Model trained on joint velocity profiles and F/T readings to detect phase boundaries. Works without any visual processing, making it fast and robust to camera issues. Achieves approximately 85% segment boundary accuracy within ±3 frames on standard pick-place tasks.

Annotation Complexity by Task Type

Task Type	Annotation Types Needed	Time per Episode	Difficulty
Simple pick-place	Success/fail + language	30-60 sec	Low
Multi-step assembly	Success/fail + segments (8-12) + contact events + language	3-5 min	High
Precision insertion	Success/fail + contact events + quality score + F/T annotation	2-4 min	High
Drawer/cabinet opening	Success/fail + segments (4-6) + language	1-2 min	Medium
Cloth/deformable manipulation	Partial credit (0-100) + segments + quality score + language	3-6 min	Very high
Bimanual coordination	Success/fail + per-arm segments + synchronization labels + language	4-8 min	Very high

The cost multiplier from simple pick-place to bimanual annotation is roughly 8x. Teams that budget annotation time based on their simplest tasks and then move to more complex tasks are consistently surprised by the real cost. Plan annotation resources per task type, not per episode count.

Timestamp Alignment Challenges

Robot demonstration data comes from multiple asynchronous sources: cameras at 30-60 fps, joint state readings at 100-500 Hz, F/T sensors at 500-1000 Hz, and gripper state at variable rates. Aligning these streams to a common timeline is a prerequisite for meaningful annotation. A contact event timestamp is useless if the camera frame it corresponds to is offset by 50ms from the actual contact because the camera and joint state clocks drifted.

The standard approach is hardware-triggered synchronization: a master clock generates a trigger pulse that simultaneously triggers camera frame capture and timestamps the joint state and F/T sensor readings. Without hardware sync, software-based timestamps have typical jitter of 5-20ms on a standard Linux system and 1-5ms on a real-time (PREEMPT_RT) kernel. For annotation purposes, software sync is adequate for success/failure and language labels but insufficient for contact event timestamps on tasks with tight timing requirements (insertion, snap-fit assembly).

Common alignment pitfalls to watch for: USB cameras that buffer frames internally (adding 30-100ms of unknown latency), network cameras (IP cameras) that add variable encoding delay, and ROS2 topic timestamps that reflect publication time rather than actual measurement time. Always verify alignment by recording a known physical event (dropping an object onto a surface with an F/T sensor) and checking that the visual contact frame and the F/T spike align within your required tolerance.

Annotating Failure Cases: Partial Successes and Near-Misses

Binary success/failure annotation is the minimum, but it discards critical information about how and why episodes failed. A policy trained only on binary-filtered data treats all failures as equally uninformative, when in fact there is a spectrum from "robot moved in the wrong direction" (low information) to "robot grasped the object but dropped it during transport" (high information -- the grasp strategy was correct, the transport phase needs work).

A practical partial-success taxonomy for manipulation tasks:

Score 0-20 (complete failure): Robot did not approach the object or approached entirely wrong region. Exclude from training data.
Score 21-40 (approach correct, grasp failed): Robot reached the correct region but failed to establish a stable grasp. Useful for training approach policies; exclude from full-task training.
Score 41-60 (grasp succeeded, task failed): Robot grasped the object but failed during transport, placement, or a subsequent phase. Include with reduced weight (0.3-0.5x) for training, or use for phase-specific training.
Score 61-80 (task mostly complete, imprecise): Robot completed the task but with poor quality -- object placed in roughly the right area but not precisely, or task completed slowly with jerky motions. Include at reduced weight (0.7x) or use for quality-aware training.
Score 81-100 (success with quality gradient): Task completed successfully. Higher scores indicate smoother, faster, more precise execution. Include at full weight.

Training with partial-success scores using weighted sampling consistently outperforms binary filtering. In SVRC's experiments, weighted training on a dataset with 30% partial failures produced policies that were 8-12% more robust than policies trained only on the top-scored 70% of episodes.

Quality Gates: When to Reject Annotations

Not all annotations are usable. Define explicit quality gates that trigger re-annotation or episode exclusion:

Reject if: the annotator viewed less than 80% of the episode video before labeling (detectable via annotation tool analytics).
Reject if: annotation time is below 60% of the expected minimum (speed-running annotations are unreliable -- a 30-second episode reviewed in 5 seconds was not properly evaluated).
Reject if: success/failure label disagrees with automated classifier output AND the annotator did not flag the episode as borderline (indicates inattention, not genuine disagreement).
Reject if: language instruction does not match the visible task in the video (generic copy-paste labels are surprisingly common with crowdsource annotation).
Reject if: segment boundary annotations are missing for any phase (incomplete annotations are worse than no annotations because they create training artifacts).
Flag for review if: two annotators disagree on success/failure. Both annotations go to a reconciliation queue where a senior annotator adjudicates and the annotation protocol is updated if the disagreement reveals an ambiguity.

At SVRC, approximately 8-12% of initial annotations fail quality gates and are re-annotated. This rejection rate is normal and healthy -- it indicates the quality gates are functioning. A 0% rejection rate means the gates are too permissive.

Annotation Tools Comparison

Tool	Type	Robot-Specific Features	Cost
Label Studio	Open-source	Video timeline annotation; custom label schemas; Python SDK for automation	Free / Enterprise
CVAT	Open-source	Frame-level annotation; interpolation; multi-track timeline	Free
Scale AI	Managed service	Trained annotator workforce; quality management; custom ontologies	$5-15/episode
Supervisely	Cloud platform	Video annotation; neural network-assisted labeling; team management	Free tier / paid
SVRC Platform	Integrated	Robot-native: synchronized multi-camera + joint state + F/T playback; auto-classifier pre-labels; kappa tracking	Included with data services

For robot-specific annotation, general-purpose tools (Label Studio, CVAT) require significant customization to handle synchronized multi-modal data. The key missing feature in general tools is synchronized playback of video, joint state plots, and F/T sensor data on a common timeline. SVRC's annotation interface was built specifically for this use case.

ROS2 Timestamp Synchronization: Practical Implementation

For teams using ROS2 as their data collection middleware, here is the recommended approach to achieve sub-5ms timestamp alignment across all sensor modalities.

# sync_recorder.py -- ROS2 message_filters for synchronized recording
import rclpy
from rclpy.node import Node
from message_filters import ApproximateTimeSynchronizer, Subscriber
from sensor_msgs.msg import Image, JointState
from geometry_msgs.msg import WrenchStamped
import h5py, numpy as np

class SyncRecorder(Node):
    def __init__(self):
        super().__init__('sync_recorder')
        # Subscribe to all sensor topics
        self.cam_sub = Subscriber(self, Image, '/camera/color/image_raw')
        self.joint_sub = Subscriber(self, JointState, '/joint_states')
        self.ft_sub = Subscriber(self, WrenchStamped, '/ft_sensor/wrench')

        # ApproximateTimeSynchronizer: 50ms slop tolerance
        self.sync = ApproximateTimeSynchronizer(
            [self.cam_sub, self.joint_sub, self.ft_sub],
            queue_size=10,
            slop=0.05  # 50ms max allowed timestamp difference
        )
        self.sync.registerCallback(self.synced_callback)
        self.episode_data = []

    def synced_callback(self, img_msg, joint_msg, ft_msg):
        """Called only when all three messages have near-matching timestamps."""
        timestamp = img_msg.header.stamp.sec + img_msg.header.stamp.nanosec * 1e-9
        self.episode_data.append({
            'timestamp': timestamp,
            'image': np.frombuffer(img_msg.data, dtype=np.uint8).reshape(480, 640, 3),
            'joint_positions': np.array(joint_msg.position),
            'joint_velocities': np.array(joint_msg.velocity),
            'wrench': np.array([
                ft_msg.wrench.force.x, ft_msg.wrench.force.y, ft_msg.wrench.force.z,
                ft_msg.wrench.torque.x, ft_msg.wrench.torque.y, ft_msg.wrench.torque.z
            ]),
        })

Key implementation notes: (1) Use ApproximateTimeSynchronizer with a 50ms slop for software-synced systems. For hardware-triggered cameras, reduce slop to 10ms. (2) Always use the message header timestamp, not rclpy.clock.now(), which reflects processing time, not measurement time. (3) Log the actual timestamp differences between synchronized messages -- if the mean exceeds 20ms, investigate whether USB bandwidth or CPU load is causing delays. (4) For annotation purposes, store timestamps in the HDF5 file alongside the data so that annotators can verify alignment during review.

Expanded Quality Gates: Production Annotation Pipeline

Beyond the basic quality gates, production annotation pipelines should implement these additional checks.

Quality Gate	Check Method	Threshold	Action if Failed
Timestamp alignment	Max camera-joint offset per episode	< 50ms (software) / < 10ms (hardware)	Discard episode; debug sync pipeline
Frame drops	Count gaps > 2x expected frame interval	< 3% of frames dropped	Interpolate if < 3 consecutive; discard episode if > 3 consecutive
Joint limit violation	Any joint within 2 deg of hard limit	Flag for review	Include if task succeeded; exclude if triggered safety stop
Episode duration outlier	Duration > 3 sigma from task mean	Flag for review	Review video; may indicate operator hesitation or unusual strategy
Language label uniqueness	Detect identical labels on visually different episodes	> 20% unique labels in batch	Re-annotate batch; suspect copy-paste
Gripper state consistency	Verify gripper open at start, closed during transport	Match expected phase sequence	Flag episodes where gripper sequence is anomalous
F/T spike detection	Force exceeds 2x task maximum	Flag for review	May indicate collision or unsafe contact; exclude from training
Camera image quality	Laplacian variance (blur detection)	Variance > 100 (focused image)	Discard blurry episodes; re-focus camera

Implementing these quality gates as automated checks in your data pipeline catches 70-80% of data quality issues before human review. The remaining 20-30% require human judgment and are handled through the standard annotation QA process. SVRC runs all these checks automatically on every collected episode before annotation begins.

Automated Contact Event Detection from F/T Data

Contact event timestamps are among the most time-consuming annotations to produce manually. For hardware setups that include F/T sensors, automated detection can achieve frame-level accuracy with minimal human oversight.

# contact_detector.py -- Automated contact event detection from F/T data
import numpy as np
from scipy.signal import savgol_filter

class ContactEventDetector:
    """Detect first contact, stable grasp, and release from F/T readings."""

    def __init__(self, force_threshold=2.0, stable_window=10, release_threshold=0.5):
        self.force_threshold = force_threshold    # Newtons
        self.stable_window = stable_window        # frames
        self.release_threshold = release_threshold  # Newtons

    def detect_events(self, ft_data, timestamps):
        """
        Args:
            ft_data: (N, 6) array of [fx, fy, fz, tx, ty, tz]
            timestamps: (N,) array of timestamps in seconds
        Returns:
            dict with 'first_contact', 'stable_grasp', 'release' timestamps
        """
        force_magnitude = np.linalg.norm(ft_data[:, :3], axis=1)
        # Smooth to remove sensor noise
        force_smooth = savgol_filter(force_magnitude, window_length=7, polyorder=2)

        events = {}
        # First contact: force exceeds threshold for the first time
        contact_mask = force_smooth > self.force_threshold
        if contact_mask.any():
            idx = np.argmax(contact_mask)
            events['first_contact'] = timestamps[idx]

            # Stable grasp: force stays above threshold for stable_window frames
            for i in range(idx, len(force_smooth) - self.stable_window):
                if all(force_smooth[i:i+self.stable_window] > self.force_threshold):
                    events['stable_grasp'] = timestamps[i]
                    break

        # Release: force drops below release_threshold after stable grasp
        if 'stable_grasp' in events:
            grasp_idx = np.searchsorted(timestamps, events['stable_grasp'])
            post_grasp = force_smooth[grasp_idx:]
            release_mask = post_grasp < self.release_threshold
            if release_mask.any():
                rel_idx = grasp_idx + np.argmax(release_mask)
                events['release'] = timestamps[rel_idx]

        return events

This detector achieves 90%+ accuracy on standard pick-place tasks with the default parameters. For insertion tasks, reduce force_threshold to 0.5N and increase stable_window to 20 frames. For high-force tasks (e.g., snap-fit assembly with 20-50N peak forces), increase the threshold proportionally. Always validate automated detection against 50+ manually annotated episodes before trusting it for production annotation.

Annotation Pipeline Architecture

A production annotation pipeline for robot data should follow this architecture, ordered from collection to training-ready output:

Collection layer: Synchronized recording of all modalities (camera, joint state, F/T, gripper) into episode HDF5 files with hardware timestamps.
Automated pre-processing: Run automated quality gates (timestamp alignment, frame drop detection, blur detection) and automated classifiers (success/failure CNN, contact event detector). This layer flags or excludes ~15-20% of episodes before any human reviews them.
Tier 1 annotation (automated + spot-check): For simple tasks with high automated classifier accuracy (>90%), accept auto-labels with 10% human spot-check. Disagreements between auto-label and human go to Tier 2.
Tier 2 annotation (human primary): For complex tasks, borderline cases, and language annotation. Each episode reviewed by two independent annotators. Inter-annotator agreement (kappa) computed per batch.
Reconciliation: Episodes where annotators disagree are reviewed by a senior annotator. Protocol is updated if disagreement reveals ambiguity. Reconciled labels are final.
Export: Annotated episodes exported in training-ready format (LeRobot HDF5 or RLDS) with annotation metadata (annotator ID, annotation time, confidence, kappa score for that batch).

Failure analysis annotations: For failed demonstrations, add a structured failure cause label from a pre-defined taxonomy: (a) perception failure (object not detected or localized incorrectly), (b) grasp failure (object slipped or was not acquired), (c) transport failure (object dropped during movement), (d) placement failure (object placed in wrong location or orientation), (e) timeout (task not completed within time limit). This failure taxonomy enables automated analysis of collection quality and targeted data recollection. For example, if 40% of failures are grasp failures, the operator needs retraining on grasp technique rather than more demonstrations of the full task.

This tiered approach reduces human annotation cost by 60-70% compared to full human annotation while maintaining quality. SVRC runs this pipeline for all data collection engagements, with the automated layers handling routine annotation and human annotators focusing on the cases that require judgment.

Language Annotation Best Practices

Language annotations are critical for VLA fine-tuning and language-conditioned policy training. The quality of your language labels directly impacts the policy's ability to follow instructions at deployment time. Follow these guidelines:

Specify the object with at least two attributes. "Pick up the cup" is insufficient. "Pick up the red plastic cup" is minimum. "Pick up the tall red plastic cup from the left side of the table" is ideal for multi-object scenes.
Include spatial references when relevant. "Place on the tray" is ambiguous. "Place on the center of the white tray" or "Place on the tray, left of the bowl" provides spatial grounding that helps the policy learn spatial reasoning.
Vary language naturally. Do not use the same template for every episode. Alternate between "pick up," "grasp," "grab," and "take" for the same action. Use both "put down" and "place." This variation teaches the policy to handle natural language diversity at inference time.
Label at the task level, not the step level, for most tasks. "Pick up the red cup and place it on the tray" is a single task instruction for a pick-and-place episode. Step-level labels ("reach for the cup," "close gripper," "lift," "move right," "open gripper") are needed only for segment-conditioned policies.
Use template validation. Define a set of valid object names, colors, and spatial references for your task. Check all annotations against this vocabulary. Misspellings and inconsistent naming (using "mug" in some annotations and "cup" in others for the same object) create unnecessary noise for the language encoder.
Include negative instructions. For multi-object scenes, include instructions that specify which object NOT to pick ("pick up the red cup, not the blue one"). Training with negative instructions improves the policy's ability to disambiguate between similar objects, providing 8-12% improvement on multi-object scene success rates compared to positive-only instructions.
Annotate after collection, not during. Some teams ask operators to speak task instructions during teleoperation. This splits operator attention and degrades demonstration quality. Instead, annotate language labels in a separate pass after collection. A dedicated annotator watching episode videos can write better instructions (more specific, more consistent) than an operator multi-tasking during teleoperation.

Annotation Cost Breakdown by Task Complexity

Annotation cost varies dramatically with task complexity and the level of annotation detail required. Understanding these costs is critical for budgeting data collection projects.

Annotation Type	Time per Episode	Cost per Episode	Automation Potential	Required For
Binary success/failure	5-10 seconds	$0.10-0.25	90%+ (CNN classifier)	All training pipelines
Language instruction label	15-30 seconds	$0.25-0.75	50-70% (template + VLM)	VLA fine-tuning, language-conditioned policies
Phase segmentation (4 phases)	30-60 seconds	$0.50-1.50	60-80% (contact detector + heuristics)	Segment-conditioned training, curriculum learning
Partial credit scoring (0-100)	30-90 seconds	$0.75-2.00	20-40% (requires judgment)	Weighted BC training, reward shaping
Object 6-DOF pose per frame	5-15 min	$5-20	70-85% (FoundationPose + refinement)	Object-centric policy training, grasp analysis
Full semantic scene graph	10-30 min	$10-40	30-50% (VLM + manual review)	Scene understanding, task planning research

For most imitation learning projects, binary success/failure labels plus language instruction labels are the minimum viable annotation set. This costs $0.35-1.00 per episode at scale, with 60-80% of the annotation automated. Phase segmentation adds value for curriculum learning approaches but doubles the annotation cost. Full 6-DOF pose annotation is only justified for specific research needs -- for standard policy training, it provides minimal benefit over the cheaper annotation types.

Scaling Annotation with VLMs: Using GPT-4V and Gemini for Robot Data

Vision-language models (VLMs) are increasingly useful for automating annotation tasks that previously required human judgment. Here is how SVRC integrates VLMs into the annotation pipeline.

Language instruction generation. Send the first and last frame of an episode to GPT-4V or Gemini with the prompt: "Describe the manipulation task shown in these before/after images. Be specific about the object (color, material, shape) and the action performed. Use 10-15 words." The VLM generates natural language instructions with 85-90% accuracy on standard pick-place tasks. A human reviewer corrects the remaining 10-15% in 5-10 seconds per episode (checking is faster than writing from scratch).

Success/failure verification. Send the final frame to the VLM with the prompt: "The robot was asked to [task instruction]. Based on the final scene, was the task completed successfully? Answer yes or no with a confidence score." VLMs achieve 88-93% accuracy on binary success classification for pick-place and stacking tasks, dropping to 75-82% for insertion and assembly tasks where success is visually subtle.

Failure mode categorization. For failed episodes, the VLM can categorize the failure type: "Did the robot (a) fail to grasp the object, (b) drop the object during transport, (c) place the object in the wrong location, or (d) other?" This categorization feeds into the failure analysis that guides targeted data recollection. Accuracy: 80-85% for the four standard categories.

Cost impact: VLM-assisted annotation reduces the per-episode human annotation time by 40-60% for language labels and by 70-80% for binary labels. At current API pricing (GPT-4V at ~$0.01 per image pair, Gemini at ~$0.005), the VLM inference cost is negligible compared to the human time saved. The net effect is a 30-50% reduction in total annotation cost per episode.

Quality Assurance Metrics: What to Track and What to Target

Annotation quality must be quantified and tracked over time. These are the metrics SVRC monitors for every data collection engagement, with target thresholds based on our experience across 50+ projects.

Inter-annotator agreement (Cohen's kappa). For binary labels: target kappa > 0.90 (near-perfect agreement). For language labels: compute semantic similarity between annotators' descriptions using sentence-BERT embeddings; target cosine similarity > 0.85. For phase boundaries: target agreement within +/- 3 frames (100ms at 30fps).
Annotation throughput. Track episodes annotated per hour per annotator. Throughput below 80% of the expected rate (based on annotation type complexity) indicates either task ambiguity requiring protocol clarification or annotator fatigue requiring a break.
Correction rate. Track the fraction of VLM-generated labels corrected by human reviewers. A correction rate above 20% indicates the VLM prompt needs refinement or the task is too complex for automated annotation. Below 10% suggests human review could be sampled rather than exhaustive.
Downstream training impact. The ultimate quality metric: train a policy on the annotated data and measure success rate. If adding more annotations (e.g., phase boundaries) does not improve the policy by at least 3%, the additional annotation cost is not justified for that task.

SVRC Annotation Pipeline

SVRC's data collection service includes annotation as standard. All collected episodes receive: binary success/failure label (automated + human review for borderline cases), language instruction label (protocol-defined per task), four-phase segment boundaries (reach/grasp/transport/place), and contact event timestamps where F/T sensors are present. Additional annotation types (partial credit scores, expanded segment sets, quality scores) are available as add-ons.

All annotation is accompanied by inter-annotator agreement metrics (kappa scores per annotation type) and documented reconciliation records for kappa < 0.8 cases. See our data services page for full annotation specifications.

Annotation Tool Requirements for Robot Data

Off-the-shelf image annotation tools (Labelbox, CVAT, Label Studio) are designed for single-image classification and bounding box tasks. Robot episode annotation has specific requirements that most general tools do not support out of the box.

Multi-modal synchronization. The annotation tool must display synchronized video from multiple cameras alongside proprioceptive time series (joint angles, F/T readings) and action signals. Annotators need to see the full context -- a gripper closing event is visible in the wrist camera, proprioceptive data, and sometimes the overhead camera simultaneously.
Temporal annotation. Unlike image annotation which labels a single frame, robot annotation requires marking temporal boundaries (phase start/end, contact events) by scrubbing through a video timeline. The tool must support frame-accurate temporal marking at 30-50 fps resolution.
Episode-level metadata. Each episode needs structured metadata: success/failure, task instruction, quality score, operator ID, collection conditions. The tool should support customizable metadata schemas that change per task.
Batch review workflow. The reviewer should be able to sort episodes by automated quality scores, flag disagreements, and bulk-approve batches that pass automated checks. Without batch workflow support, human review becomes the bottleneck at scale (500+ episodes).

SVRC's annotation platform is purpose-built for robot data and supports all four requirements. For teams building their own annotation tooling, Label Studio with custom frontend extensions is the most flexible open-source starting point -- budget 2-4 weeks of frontend development to add the multi-modal synchronization and temporal marking features.