Robot Learning

Diffusion Policy for Robot Learning: What It Is and How to Use It

Diffusion Policy, introduced by Chi et al. in 2023, brought the generative modeling revolution to robot control. By treating action generation as a denoising problem, it handles the multimodal, high-dimensional nature of manipulation behavior in ways that simpler behavioral cloning algorithms cannot. Here is what you need to know to apply it to your own robotics project.

What Is Diffusion Policy?

Diffusion Policy is a class of robot control policies based on denoising diffusion probabilistic models (DDPMs) — the same mathematical framework that underlies text-to-image models like Stable Diffusion. In the robot context, the "image" being generated is a sequence of robot actions (a trajectory). Starting from pure Gaussian noise in action space, the model iteratively denoises it conditioned on the current visual observation and robot state, producing a coherent, high-quality action sequence after 10–100 denoising steps.

The key insight is that diffusion models learn a full probability distribution over actions rather than predicting a single best action. For robotics, this is critical. Human demonstrations of the same task are naturally multimodal: a person might grasp a cup from the left side or the right side depending on subtle contextual cues. A model that must collapse this distribution to a single prediction will either commit to one mode and fail the other half of the time, or average the modes and produce a bizarre in-between trajectory that fails always. Diffusion Policy avoids this by modeling the distribution explicitly and sampling from it at inference time.

Score Matching Theory: How Diffusion Models Learn Actions

The mathematical foundation of Diffusion Policy is score matching — learning the gradient of the log probability density of the data distribution. In the forward process, Gaussian noise is progressively added to action trajectories from the demonstration dataset over T timesteps, following a variance schedule. At step T, the noised trajectory is approximately standard Gaussian noise. The neural network is trained to reverse this process: given a noisy action trajectory and the current observation, predict the noise that was added (or equivalently, the "score" — the gradient of the log density).

The training objective is deceptively simple: a mean-squared-error loss between the predicted noise and the actual noise added during the forward process. Formally, for a noise prediction network epsilon_theta conditioned on observation o:

            L = E[||epsilon - epsilon_theta(sqrt(alpha_t) * a_0 + sqrt(1-alpha_t) * epsilon, o, t)||^2]
          

where a_0 is the clean action trajectory, epsilon is sampled Gaussian noise, alpha_t is the noise schedule parameter at timestep t, and the expectation is over all training examples, noise samples, and timesteps. This objective is equivalent to score matching on the noised distribution at each timestep — the network learns the direction "toward" clean data from any noised state.

What makes this work for robotics is the conditioning mechanism. The observation o — typically image features from a CNN or ViT encoder, plus proprioceptive state — biases the denoising toward action trajectories that are appropriate for the current scene. The same denoising network can produce radically different trajectories for different observations, because the observation features steer the denoising process through different regions of the learned action distribution.

U-Net vs Transformer Architectures

The original Diffusion Policy paper (Chi et al., 2023) evaluated two backbone architectures for the denoising network. Understanding the tradeoffs is essential for choosing the right variant for your project.

CNN U-Net Backbone (Diffusion Policy-C)

The U-Net backbone treats the action sequence as a 1D signal and applies a convolutional architecture with skip connections — the same structure used in image diffusion models, but operating on temporal rather than spatial dimensions. The observation features are injected via FiLM (Feature-wise Linear Modulation) conditioning at each U-Net block.

Parameters: ~25M for a standard configuration
Training time: 4-8 hours on RTX 3090 for 200 episodes
Inference (DDPM, 100 steps): ~900ms per action chunk
Inference (DDIM, 10 steps): ~15ms per action chunk
Strengths: Fast inference, lower GPU memory, well-suited to single-task policies
Weaknesses: Limited capacity for multi-task or language-conditioned settings

Transformer Backbone (Diffusion Policy-T)

The Transformer backbone treats each action timestep as a token, adding observation tokens and diffusion timestep embeddings to the sequence. Self-attention across the full sequence enables richer temporal modeling of action dependencies.

Parameters: ~60-100M depending on configuration
Training time: 8-16 hours on RTX 3090 for 200 episodes
Inference (DDPM, 100 steps): ~2.5s per action chunk
Inference (DDIM, 10 steps): ~45ms per action chunk
Strengths: Higher capacity, better multi-task scaling, natural language conditioning via cross-attention
Weaknesses: Slower inference, higher GPU memory (needs 16GB+ VRAM), harder to tune

Practical recommendation: Use the U-Net backbone for single-task policies where inference speed matters (real-time control at 10Hz+). Use the Transformer backbone for multi-task policies, language-conditioned settings, or when you have 500+ demonstrations and the capacity to benefit from a larger model. The U-Net variant is the right starting point for 80% of projects.

Inference Time: DDPM vs DDIM vs Consistency Distillation

The inference-time cost of Diffusion Policy is its primary practical limitation. The denoising process requires multiple forward passes through the network, each producing a slightly less noisy action trajectory. The number of steps directly determines both inference latency and action quality.

Scheduler	Steps	Latency (U-Net, RTX 3090)	Quality vs DDPM-100	Notes
DDPM	100	~900ms	Baseline (100%)	Too slow for most real-time control
DDIM	25	~40ms	98-99%	Good default for deployment
DDIM	10	~15ms	95-98%	Recommended for 10Hz+ control
Consistency Distillation	1-3	~3-5ms	90-95%	Best for high-frequency control, requires additional training

The action chunking mechanism mitigates the latency problem: Diffusion Policy predicts a chunk of 16-32 future actions in a single denoising pass. The robot executes the first 8 actions from the chunk at the control frequency (e.g., 10Hz) while the next chunk is being computed. This overlapped execution means the effective control rate is limited by the chunk execution time, not the denoising time — as long as denoising completes before the current chunk runs out.

Why Diffusion Policy Outperforms Standard Behavioral Cloning

Standard behavioral cloning (BC) trains a policy as a supervised regression problem: given observation, predict action. This works when the mapping from observations to actions is deterministic and unimodal. In practice, manipulation tasks rarely are. Even "simple" tasks like picking a block off a table involve multiple valid approach angles, grasp poses, and pre-grasp configurations. Naive BC produces policies that hesitate at decision points, make compromised motion choices, or fail outright when the test distribution differs slightly from training.

Diffusion Policy consistently outperforms BC baselines on benchmark manipulation suites. In the original paper, it achieved state-of-the-art results on 11 of 12 tasks in the Robomimic benchmark, with particularly large margins on tasks with high action multimodality. On real-robot evaluations, Diffusion Policy demonstrated more robust recovery behavior — when the robot reached a slightly wrong intermediate state, the policy could recover because it was sampling from a broad distribution rather than following a deterministic path.

Benchmark Results: Robomimic and RoboSuite

Task (Robomimic)	BC (MLP)	BC-RNN	ACT	Diffusion Policy
Lift	78%	96%	98%	100%
Can	54%	82%	90%	96%
Square (bimanual)	18%	56%	72%	88%
Transport (long-horizon)	6%	24%	48%	62%

The margins are largest on multimodal tasks (Square, Transport) where multiple valid strategies exist. On unimodal tasks (Lift), the advantage is smaller because all baselines can find the single correct strategy.

When to Choose Diffusion Policy vs ACT

Compared to ACT (Action Chunking with Transformers), Diffusion Policy generally performs better on tasks with strong multimodality and worse on tasks with long horizon dependencies where ACT's chunk prediction shines. Here is a decision framework:

Choose Diffusion Policy When	Choose ACT When
Multiple valid grasp strategies exist for each scene	Task has a single dominant strategy
You have 300+ demonstrations and want to leverage data scale	You have 50-150 demonstrations and need fast iteration
Recovery from perturbations is important	Temporal consistency over long horizons matters more
Control rate of 10Hz is sufficient	You need 50Hz+ control frequency
Single-arm manipulation with variable approach	Bimanual coordination requiring tight temporal sync

In practice, both algorithms are competitive enough that dataset quality and quantity matter more than the policy architecture choice. If you are unsure which to use, try ACT first for speed of iteration, then Diffusion Policy if you observe mode-averaging failures.

Data Requirements for Diffusion Policy

Diffusion Policy benefits from more data than ACT, primarily because the denoising network has more parameters and a richer modeling objective. A practical minimum is 100-200 demonstrations for a single task under controlled conditions. To achieve robust deployment performance — handling object position variation, lighting changes, and occasional sensor noise — budget 300-500 demonstrations per task. Unlike ACT, Diffusion Policy tends to continue improving with additional data up to quite large dataset sizes, making it the better choice if you plan to invest in a large-scale data collection effort.

Data diversity is as important as volume. Demonstrations should span the range of object positions, orientations, and scene configurations you expect in deployment. A tight cluster of demonstrations with objects always in exactly the same place will produce a policy that fails the moment an object is moved by a few centimeters. SVRC's managed data collection service follows structured variation protocols — systematically randomizing object positions, lighting conditions, and operator grip styles — to ensure datasets that produce generalizable policies.

The observation representation also matters significantly. Diffusion Policy with a ResNet image encoder trained end-to-end generally outperforms policies using frozen pre-trained encoders on narrow task distributions, but pre-trained encoders (R3M, MVP, DINO) produce better generalization when test conditions differ from training. For most practical projects, start with a pre-trained encoder to maximize the value of your dataset, and switch to end-to-end training only if you have 500+ demonstrations and a stable environment.

Training Setup and Compute Requirements

The reference implementation of Diffusion Policy (available at the Columbia Robotics Lab GitHub) trains with either a UNet backbone (faster inference, lower capacity) or a Transformer backbone (slower inference, higher capacity). For most single-task projects, the UNet variant is the right starting point. Training on a single RTX 3090 or 4090 takes 4-12 hours for a 200-episode dataset, depending on observation resolution and action horizon length.

Key hyperparameters to set correctly: the action horizon (how many future steps to predict — typically 16-32 for tabletop tasks), the number of diffusion steps (100 for DDPM, 10-25 for DDIM with minimal quality loss), and the observation window (how many past frames to include — typically 2). Do not change all three at once; fix the others when tuning one. The most impactful change for improving policy performance is usually increasing the dataset size, not tuning architecture hyperparameters.

Quick-Start Training Commands

# Clone the reference implementation
git clone https://github.com/real-stanford/diffusion_policy.git
cd diffusion_policy
pip install -e .

# Train U-Net variant on your dataset (ZARR format)
python train.py --config-dir=. --config-name=image_pusht_diffusion_policy_cnn \
  task.dataset_path=/path/to/your/dataset.zarr \
  training.num_epochs=3000 \
  policy.noise_scheduler.num_train_timesteps=100 \
  policy.horizon=16 \
  policy.n_obs_steps=2

# Evaluate with DDIM inference
python eval.py --config-dir=. --config-name=image_pusht_diffusion_policy_cnn \
  policy.noise_scheduler._target_=diffusers.DDIMScheduler \
  policy.num_inference_steps=10

For inference on a real robot, DDPM at 100 steps is typically too slow for high-frequency control. Use the DDIM scheduler with 10-25 steps, which runs at ~20Hz on an RTX 3090 — adequate for 10Hz control with a buffer. Alternatively, consistency policy distillation can achieve 1-3 step inference with minimal performance degradation for simpler tasks.

Using SVRC Data Services for Diffusion Policy

SVRC's data services pipeline produces datasets formatted for direct use with the Diffusion Policy reference implementation and the HuggingFace LeRobot framework. Episodes are stored as ZARR archives with synchronized image streams, proprioceptive state, and actions at 50Hz. Quality filtering removes episodes where the task was not completed successfully, the robot collided with the environment, or operator hesitation produced non-representative trajectories.

Our collection service uses the SVRC teleoperation platform with dual-arm capable leader-follower control, wrist-mounted and overhead cameras, and optional force-torque logging. For multi-task Diffusion Policy training — where a single policy learns multiple tasks conditioned on task ID or language — we can collect across task variants within the same campaign and deliver a unified dataset. Pilot programs start at $2,500 for 200 demonstrations; full campaigns for 500+ demonstrations start at $8,000.

Teams working with the OpenArm 1 ($4,500) or ALOHA hardware platforms get native hardware support; custom hardware integration is available on request. For teams that want to evaluate Diffusion Policy before investing in data collection, our public datasets include several hundred-episode manipulation datasets in ZARR format ready for training. Contact our team to discuss your data requirements and timeline.