A new paradigm for large language modeling based on diffusion
What is now proved was once only imagined.
We contend that the intelligence of LLMs—manifested in scalability, instruction-following, in-context learning, conversational ability, and compression—stems not from the autoregressive mechanism per se, but rather from the core principle of generative modeling: approximating the true language distribution through maximum likelihood estimation.
We introduce LLaDA (Large Language Diffusion with mAsking), a simple yet principled generative paradigm for large language models that demonstrates the aforementioned remarkable capabilities.
LLaDA is a masked diffusion model [1, 2, 3] that follows standard pretraining and SFT while sampling via diffusion. During pretraining, it masks all tokens randomly at ratio \( t \sim U[0,1] \); in SFT, only response tokens may be masked. The model simulates diffusion from full masking (\(t = 1\)) to unmasking (\(t = 0\)), predicting all masks simultaneously at each step with flexible remasking.
LLaDA demonstrates impressive scalability, with its overall trend being highly competitive with that of autoregressive baseline on the same data.
A text generation method different from traditional left-to-right approach.
Prompt: "Explain what artificial intelligence is."
@article{nie2025large,
title={Large language diffusion models},
author={Nie, Shen and Zhu, Fengqi and You, Zebin and Zhang, Xiaolu and Ou, Jingyang and Hu, Jun and Zhou, Jun and Lin, Yankai and Wen, Ji-Rong and Li, Chongxuan},
journal={arXiv preprint arXiv:2502.09992},
year={2025}
}
[1] Austin, J., Johnson, D. D., Ho, J., Tarlow, D., and Van Den Berg, R. Structured denoising diffusion models in discrete state-spaces. Advances in Neural Information Processing Systems, 34:17981-17993, 2021a.
[2] Ou, J., Nie, S., Xue, K., Zhu, F., Sun, J., Li, Z., and Li, C. Your absorbing discrete diffusion secretly models the conditional distributions of clean data. arXiv preprint arXiv:2406.03736, 2024.
[3] Nie S, Zhu F, Du C, et al. Scaling up Masked Diffusion Models on Text. arXiv preprint arXiv:2410.18514, 2024.