We contend that the intelligence of LLMs—manifested in scalability, instruction-following, in-context learning, conversational ability, and compression—stems not from the autoregressive mechanism per se, but rather from the core principle of generative modeling: approximating the true language distribution through maximum likelihood estimation.
We introduce LLaDA (Large Language Diffusion with mAsking), a simple yet principled generative paradigm for large language models that demonstrates the aforementioned remarkable capabilities.
LLaDA is a masked diffusion model [1, 2, 3] that follows standard pretraining and SFT while sampling via diffusion. During pretraining, it masks all tokens randomly at ratio \( t ∼ U[0,1] \); in SFT, only response tokens may be masked. The model simulates diffusion from full masking (\(t = 1\)) to unmasking (\(t = 0\)), predicting all masks simultaneously at each step with flexible remasking.
LLaDA demonstrates impressive scalability, with its overall trend being highly competitive with that of autoregressive baseline on the same data.
Prompt: Explain what artificial intelligence is.
@article{nie2025large,
title={Large language diffusion models},
author={Nie, Shen and Zhu, Fengqi and You, Zebin and Zhang, Xiaolu and Ou, Jingyang and Hu, Jun and Zhou, Jun and Lin, Yankai and Wen, Ji-Rong and Li, Chongxuan},
journal={arXiv preprint arXiv:2502.09992},
year={2025}
}
[1] Austin, J., Johnson, D. D., Ho, J., Tarlow, D., and Van Den Berg, R. Structured denoising diffusion models in discrete state-spaces. Advances in Neural Information Processing Systems, 34:17981–17993, 2021a.
[2] Ou, J., Nie, S., Xue, K., Zhu, F., Sun, J., Li, Z., and Li, C. Your absorbing discrete diffusion secretly models the conditional distributions of clean data. arXiv preprint arXiv:2406.03736, 2024.
[3] Nie S, Zhu F, Du C, et al. Scaling up Masked Diffusion Models on Text[J]. arXiv preprint arXiv:2410.18514, 2024.