We contend that the intelligence of LLMs—manifested in scalability, instruction-following, in-context learning, conversational ability, and compression—stems not from the autoregressive mechanism per se, but rather from the core principle of generative modeling: approximating the true language distribution through maximum likelihood estimation.
We introduce LLaDA (Large Language Diffusion with mAsking), a simple yet principled generative paradigm for large language models that demonstrates the aforementioned remarkable capabilities.
LLaDA is a masked diffusion model [1, 2] that follows standard pretraining and SFT while sampling via diffusion. During pretraining, it masks all tokens randomly at ratio \( t ∼ U[0,1] \); in SFT, only response tokens may be masked. The model simulates diffusion from full masking (\(t = 1\)) to unmasking (\(t = 0\)), predicting all masks simultaneously at each step with flexible remasking.
LLaDA demonstrates impressive scalability, with its overall trend being highly competitive with that of autoregressive baseline on the same data.
Prompt: Explain what artificial intelligence is.
@misc{nie2025largelanguagediffusionmodels,
title={Large Language Diffusion Models},
author={Shen Nie and Fengqi Zhu and Zebin You and Xiaolu Zhang and Jingyang Ou and Jun Hu and Jun Zhou and Yankai Lin and Ji-Rong Wen and Chongxuan Li},
year={2025},
eprint={2502.09992},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.09992},
}
[1] Austin, J., Johnson, D. D., Ho, J., Tarlow, D., and Van Den Berg, R. Structured denoising diffusion models in discrete state-spaces. Advances in Neural Information Processing Systems, 34:17981–17993, 2021a.
[2] Ou, J., Nie, S., Xue, K., Zhu, F., Sun, J., Li, Z., and Li, C. Your absorbing discrete diffusion secretly models the conditional distributions of clean data. arXiv preprint arXiv:2406.03736, 2024.