Why Single-Modality Sensing Falls Short
Vision-only manipulation fails at contact. Once the robot's fingers are on an object, they often occlude the contact zone from the camera. Visual policies that work well during approach break down during the precision contact phase of tasks — insertion, stable grasping, surface following — because the information they need is now behind the robot's own hand.
Force-only control (impedance control, force servo) lacks spatial information. It can detect when contact is made and regulate contact force, but cannot tell you whether you are touching the right surface, whether the grasp pose is stable, or whether the object has moved.
Tactile sensing alone provides rich contact geometry information but has limited spatial range — you have to already be at the contact point. It is the highest-bandwidth, shortest-range modality.
The combination of all three covers each modality's failure modes and produces policies that are substantially more robust than any single modality alone.
Sensor Modality Comparison
| Modality | Range | Bandwidth | Spatial Resolution | Failure Mode | Cost |
|---|---|---|---|---|---|
| RGB camera | 0.3-5m | 30-200 Hz | Sub-pixel (0.1-0.5mm) | Lighting, occlusion, specular | $100-1,200 |
| Depth (structured light) | 0.1-3m | 30-90 Hz | 1-5mm depth error | Transparent, outdoor, multi-cam interference | $200-600 |
| Wrist 6-axis F/T | Contact only | 1,000-7,000 Hz | N/A (single point) | No spatial info, measures sum of contacts | $1,500-8,000 |
| Tactile array | Contact only | 100-500 Hz | 1-3mm per taxel | Limited area, wear, calibration drift | $300-5,000 |
| IMU (wrist-mounted) | Local motion | 200-1,000 Hz | N/A | Drift, bias, limited to acceleration/rotation | $5-50 |
| Audio (contact microphone) | 0-2m | 16-48 kHz | N/A | Background noise, non-contact states | $10-100 |
Hardware Recommendations by Modality
Vision (RGB + Depth)
Third-person overhead: Intel RealSense D435i ($250). The standard for tabletop manipulation research. 640x480 at 30fps depth + RGB. See our camera setup guide for mount positions and calibration.
Close-range wrist camera: Intel RealSense D405 ($300) for depth at 5-50cm range, or FLIR Blackfly S BFS-U3-16S2C ($450) for global-shutter RGB at up to 226fps. The D405 is optimized for close-range depth — critical for hand-eye coordination during the approach phase.
When to add stereo: If you need depth at range (1-5m) and structured-light interference is a concern (multiple cameras or outdoor), use a stereo camera like the ZED 2 ($450). Stereo depth works in sunlight and does not interfere with other depth cameras.
Force/Torque
Budget option: Robotiq FT 300 ($1,500-3,000). 100 Hz, +/-0.1N precision. Sufficient for general manipulation, grasping force control, and collision detection. Integrates directly with Universal Robots arms.
Research option: ATI Mini45 ($5,000-8,000). 7 kHz, +/-0.05N precision. The gold standard for precision insertion and assembly tasks. Required for tasks with sub-millimeter tolerances. See our contact forces guide for control modes.
Software-only option: Joint torque estimation from the robot's dynamics model. Free, always available, but limited to +/-0.5Nm precision. Useful for collision detection and coarse contact sensing. Available on Franka (built-in joint torque sensors), UR (external torque estimation), and most modern arms.
Tactile
Optical tactile (GelSight Mini): $300-600. Provides a high-resolution contact image (640x480 of the contact patch) at 30-60 Hz. Best for: texture recognition, object identification from touch, and fine-grained slip detection. Limitation: bulky form factor makes it difficult to integrate into dexterous hands.
Capacitive array (XELA uSkin): $1,500-5,000. Provides 3-axis force per taxel at 100-500 Hz. Best for: slip detection, contact localization, and real-time force distribution monitoring. Better form factor for dexterous hand integration.
Paxini tactile sensor: $500-1,500. Distributed normal + shear force at 1mm spatial resolution, 200 Hz. SVRC stocks Paxini sensors — see our hardware catalog. Good balance of resolution, speed, and form factor for gripper integration.
IMU
A wrist-mounted IMU (BNO055 or ICM-42688-P, $5-50) provides high-frequency acceleration and orientation data. Useful for: detecting impact events during manipulation, measuring vibration signatures during tool use, and providing high-frequency motion feedback when camera frame rates are insufficient. The IMU is the cheapest sensor to add and the easiest to integrate — it requires no calibration relative to other sensors and communicates over I2C or SPI.
Audio
Contact microphones ($10-100) mounted on the gripper or wrist capture acoustic signatures of contact events. Recent work (Gandhi and Pinto, 2020; Clarke et al., 2023) has shown that audio provides useful contact information: distinguishing material types (metal vs. plastic vs. wood) from tap sounds, detecting the moment of stable grasp closure, and identifying assembly success from click sounds. Audio is the most underused modality in current manipulation systems — easy to add, lightweight, and surprisingly informative.
Sensor Fusion Approaches
Early fusion: Concatenate all sensor feature vectors into a single input to the policy network. Simple to implement. Requires all sensors to be present at both training and deployment time — if any sensor fails or is not available, the policy cannot operate. Appropriate when sensor availability is guaranteed and you want the simplest possible implementation.
Late fusion: Train independent encoders for each modality, then combine at a bottleneck layer with attention or concatenation before the action head. More robust to missing sensors — you can mask out a modality's contribution during deployment if needed. Recommended for production systems where sensor failure is a real possibility.
Cross-attention transformer fusion: Treat each sensor's output as a sequence of tokens and use transformer cross-attention to let modalities attend to each other. This is the architecture used in recent foundation models (OpenVLA uses visual tokens + language tokens with cross-attention). More expressive than late fusion but requires more data to train effectively.
Hierarchical fusion: Use different modalities at different stages of the task. Vision for approach planning (long range), force for contact detection and compliance (medium range), tactile for fine manipulation (contact). This is not fusion in the neural network sense — it is a state machine that switches between modality-specific controllers based on the task phase. Simpler to implement and debug than learned fusion, and often outperforms learned approaches when the task phases are clearly separable.
ROS2 Topic Synchronization
The most common implementation challenge in multimodal systems is time synchronization. Sensors run at different frequencies (camera at 30 Hz, F/T at 1 kHz, tactile at 200 Hz) and have different latencies. ROS2 provides message_filters for approximate time synchronization:
Key synchronization parameters: the slop parameter determines the maximum allowed time difference between messages. Set it to the period of your slowest sensor (33ms for 30 Hz camera). For data collection, also record each sensor's raw timestamps so you can verify synchronization quality post-hoc. A synchronization error greater than 10ms between vision and force data can degrade policy training — the policy learns incorrect correlations between what it sees and what it feels.
Calibration Requirements
| Sensor | Calibration Type | Frequency | Time Required | Critical Error If Skipped |
|---|---|---|---|---|
| RGB camera (fixed) | Intrinsic + extrinsic (hand-eye) | Monthly or after any adjustment | 15-30 min | 5-20mm position error |
| Wrist camera | Eye-in-hand calibration | Monthly | 20 min | 2-10mm position error |
| Wrist F/T sensor | Bias removal (tare) + tool weight | Daily (tare) / monthly (full) | 2 min (tare) / 15 min (full) | 1-5N force bias error |
| Tactile array | Flat-surface zero + known-force check | Weekly | 10 min | 10-30% force reading error |
| IMU | None required (auto-calibrates) | N/A | 0 min | Minimal (bias drift is auto-corrected) |
Power and Weight Budget
Every sensor added to the robot arm increases the wrist payload and power draw. For arms with limited payload capacity (e.g., OpenArm 101 at 1kg payload, ViperX 300 at 750g), the sensor weight budget is tight:
| Component | Weight | Power | Notes |
|---|---|---|---|
| RealSense D405 (wrist) | 52g | 1.5W (USB) | Lightest depth camera for wrist |
| ATI Mini45 F/T | 92g | 2.5W | Plus 50g for cable/connector |
| GelSight Mini | 35g per finger | 1W (USB) | Adds width to finger — check gripper clearance |
| XELA uSkin pad | 15g per pad | 0.5W | Thinnest tactile option |
| Paxini tactile sensor | 20g per pad | 0.5W | Best resolution/weight ratio |
| IMU (BNO055 breakout) | 3g | 0.01W | Negligible weight and power |
| Contact microphone | 5g | 0.01W | Negligible weight and power |
A full multimodal wrist stack (D405 + ATI Mini45 + 2x GelSight Mini) weighs approximately 265g and draws 7W. This is within budget for a Franka Research 3 (3kg payload) but exceeds the payload of an OpenArm 101 (1kg) when combined with a gripper. Plan your sensor stack before selecting your arm, or choose sensors based on your arm's remaining payload after the gripper.
Practical Implementation Recommendation
- Starting point — vision only: Overhead camera + wrist camera. This configuration is sufficient for 70-80% of manipulation tasks and requires no force or tactile hardware.
- Contact-rich tasks: Add wrist F/T sensor. The 6-axis wrench provides contact detection and force regulation that dramatically improves insertion and assembly task performance.
- Dexterous manipulation: Add tactile sensing (GelSight or capacitive array). The high-resolution contact geometry enables in-hand re-grasping and slip detection.
- Full multimodal: Vision + F/T + tactile + IMU + audio. Only necessary for research on multimodal learning or for tasks where every available signal helps (e.g., autonomous surgical robotics, electronics assembly).
- Do not add modalities preemptively: Each additional modality adds data collection complexity (you need demonstrations with all sensors recording), annotation overhead, and training complexity. Add sensors when you hit a specific failure mode that they address.
Training Data Implications
Multimodal policies require demonstrations with all sensor modalities recorded simultaneously. This means your data collection setup must have all sensors calibrated and logging synchronously at the time of collection. Retrofitting force or tactile data to existing vision-only demonstrations is not possible — you need to re-collect.
One practical strategy: collect all demonstrations with the full sensor suite, then train ablated models (vision-only, vision+force, full multimodal) and evaluate each. This tells you the marginal value of each sensing modality for your specific task, which informs future data collection investments.
SVRC's data collection setup supports multi-sensor recording with synchronized timestamping across RGB, depth, wrist camera, wrist F/T, and tactile sensors. Our standard workstation configuration (OpenArm 101 or DK1 bimanual) records all modalities at 50 Hz with sub-10ms synchronization. Custom sensor integrations (Paxini tactile, GelSight, audio) available on request. Pilot data collection starts at $2,500; full campaigns at $8,000.