Recent works have shown the promise of inference-time search over action samples for improving generative robot policies. In particular, optimizing cross-chunk coherence via bidirectional decoding has proven effective in boosting the consistency and reactivity of diffusion policies. However, this approach remains computationally expensive as the diversity of sampled actions grows. In this paper, we introduce self-guided action diffusion, a more efficient variant of bidirectional decoding tailored for diffusion-based policies. At the core of our method is to guide the proposal distribution at each diffusion step based on the prior decision. Experiments in simulation tasks show that the proposed self-guidance enables near- optimal performance at negligible inference cost. Notably, under a tight sampling budget, our method achieves up to 70% higher success rates than existing counterparts on challenging dynamic tasks.
Self-Guided Action Diffusion (Self-GAD) enhances diffusion-based robot policies by injecting a guidance signal during inference. At each denoising step, we apply a lightweight update that encourages actions to stay close to prior predictions while allowing adaptation:
The loss penalizes deviations from the prior trajectory with an exponentially decaying weight:
Further, we integrate Self-GAD into GR00T-N1, a robotic foundation model rooted in a flow-matching Diffusion Transformer. This plug-in method improves closed-loop performance, particularly under persistent noise and shifting targets that challenge standard open-loop diffusion policies.
We evaluate Self-GAD against baseline diffusion policies using a single-sample closed-loop control strategy. Self-GAD outperforms Random sampling, achieving an average success rate 71.4% higher across all Robomimic benchmarks.
We integrate Self-GAD into GR00T-N1-2B, a large-scale robotic foundation model using flow-matching and multimodal embeddings. We fine-tune GR00T-N1-2B on 100 demonstrations per task in single action horizon settings (PnP Counter to Cab, Turn Stove On, Turn Sink Faucet On, Turn Microwave Off, Coffee, and Transport). Self-GAD integrated with the robotic foundation models boosts success rates by in both RoboCasa and DexMG, by 28.4% and 12% respectively.
Compared to coherence sampling, Self-GAD achieves high performance with fewer samples. Here, Self-GAD achieves near-optimal performance with a single sample, maintained from 16 samples in PushT.
We evaluate Self-GAD in noisy environments and across varied training datasets. The method consistently generalizes under distribution shift and environmental perturbations. On the RoboMimic Square task, guidance improves consistency in single-sample rollouts, with benefits increasing in high-variance settings as action diversity grows. In dynamic PushT environments, Self-GAD significantly boosts closed-loop performance, particularly under high variability.
Self-GAD reduces the performance gap between static and dynamic environments. Here, we confirm the reusability of a finetuned beta weight for prior relevance across dynamic settings.