Planned Diffusion

1UCLA 2MIT 3Google
*Equal contribution
TL;DR: Planned diffusion speeds up LLM inference by denoising spans of text in parallel from a previously generated plan.
Planned Diffusion Overview

Abstract

A central challenge in large language model inference is the trade-off between generation speed and output quality. Autoregressive models produce high-quality text but generate tokens sequentially. Diffusion models can generate tokens in parallel but often need many iterations to match the same quality. We propose planned diffusion, a hybrid method that combines the strengths of both paradigms. Planned diffusion works in two stages: first, the model creates a short autoregressive outline that breaks the output into smaller, independent spans. Second, the model generates these spans simultaneously using diffusion. This approach expands the speed–quality Pareto frontier and provides a practical path to faster, high-quality text generation. On AlpacaEval, a suite of 805 instruction-following prompts, planned diffusion achieves Pareto-optimal trade-off between quality and latency, achieving 1.84x speedup over autoregressive generation with only a 6.8% drop in win rate. Our sensitivity analysis confirms that the internal planning of our model is reliable and offers tunable control over the trade-off between generation speed and quality.

BibTeX

@misc{israel2025planneddiffusion,
      title={Planned Diffusion},
      author={Daniel Israel and Tian Jin and Ellie Cheng and Guy Van den Broeck and Aditya Grover and Suvinay Subramanian and
      Michael Carbin},
      year={2025},
      eprint={2510.18087},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2510.18087}
  }