Large Language Diffusion Models

A new paradigm for large language modeling based on diffusion

1 Renmin University of China 2 Ant Group
LLaDA vs LLaMA
LLaDA vs LLaMA Chat
TL;DR: We introduce LLaDA, a diffusion model with an unprecedented 8B scale, trained entirely from scratch, rivaling LLaMA3 8B in performance.

What is now proved was once only imagined.

— William Blake

Motivation

We contend that the intelligence of LLMs—manifested in scalability, instruction-following, in-context learning, conversational ability, and compression—stems not from the autoregressive mechanism per se, but rather from the core principle of generative modeling: approximating the true language distribution through maximum likelihood estimation.

We introduce LLaDA (Large Language Diffusion with mAsking), a simple yet principled generative paradigm for large language models that demonstrates the aforementioned remarkable capabilities.

Method

Method

LLaDA is a masked diffusion model [1, 2, 3] that follows standard pretraining and SFT while sampling via diffusion. During pretraining, it masks all tokens randomly at ratio \( t \sim U[0,1] \); in SFT, only response tokens may be masked. The model simulates diffusion from full masking (\(t = 1\)) to unmasking (\(t = 0\)), predicting all masks simultaneously at each step with flexible remasking.

Scalability

LLaDA demonstrates impressive scalability, with its overall trend being highly competitive with that of autoregressive baseline on the same data.

MMLU Scaling
ARC-C Scaling
CMMLU Scaling
PIQA Scaling
GSM8K Scaling
HumanEval Scaling

Sampling Process

A text generation method different from traditional left-to-right approach.

Prompt: "Explain what artificial intelligence is."

Diffusion Sampling

Case Studies

Math Problem
User
Emma decides to run 4 sprints 5 times a week. She runs 75 meters each sprint. How many total meters does she run a week?
LLaDA-Instruct
Emma runs 4 sprints each time, so she runs a total of 4 x 5 = 20 sprints a week. Each sprint is 75 meters long, so Emma runs a total of 20 x 75 = 1500 meters a week. The final result is 1500.
Daily Life
User
Please recommend me three famous movies
LLaDA-Instruct
Certainly! Here are three famous movies that have achieved widespread popularity and critical acclaim:

1. The Shawshank Redemption (1994) - Inspired by a novella and directed by Frank Darabont, this film tells the story of Andy Dufresne, a man convicted of murdering his wife, dealing with his harsh life in Shawshank Prison and forming an unlikely friendship with a fellow inmate.

2. The Godfather (1972) - Directed by Francis Ford Coppola, this film is based on Mario Puzo's novel of the same name. It follows the story of the Corleone family, particularly Michael Corleone, who rises to power and takes over the family's criminal empire.

3. The Empire Strikes Back (1980) - Directed by George Lucas, this film continues the story of a young Skywalker in the Star Wars saga. It follows Luke Skywalker as he helps the Rebel Alliance fight against the Galactic Empire and destroy the powerful weapon, the Death Star.

All of these films are celebrated for their cinematic achievements, powerful storytelling, and memorable characters.
Translation (English → Chinese)
User
Please help me translate into Chinese: 'What is now proved was once only imagined'
LLaDA-Instruct
现在被证明的曾经只是想象
Multi-turn Dialogue
User
Could you please share with me what the initial two lines of the renowned poem 'The Road Not Taken'?
LLaDA-Instruct
Certainly! The initial two lines of the renowned poem "The Road Not Taken" by Robert Frost are: "Two roads diverged in a yellow wood, And sorry I could not travel both." These lines set the stage for the poem, introducing the idea of a speaker facing a choice between two paths.
User
Please help me translate into Chinese.
LLaDA-Instruct
Sure! The first two lines of "The Road Not Taken" by Robert Frost can be translated into Chinese as: "两条路分岔在黄色的树林中,遗憾我不能同时走
User
Please also write a poem about life choices. I require 5 sentences, and the first word of each sentence starts with C as the first letter.
LLaDA-Instruct
Certainly! Here's a poem about life choices, with each sentence starting with the letter "C":

Choices cascade through life's journey,
Creating connections with each decision,
Careful consideration leads to better paths,
Courageous choices shape our destiny,
Choosing wisely guides us through the years.

Citation

@article{nie2025large,
  title={Large language diffusion models},
  author={Nie, Shen and Zhu, Fengqi and You, Zebin and Zhang, Xiaolu and Ou, Jingyang and Hu, Jun and Zhou, Jun and Lin, Yankai and Wen, Ji-Rong and Li, Chongxuan},
  journal={arXiv preprint arXiv:2502.09992},
  year={2025}
}

References

[1] Austin, J., Johnson, D. D., Ho, J., Tarlow, D., and Van Den Berg, R. Structured denoising diffusion models in discrete state-spaces. Advances in Neural Information Processing Systems, 34:17981-17993, 2021a.

[2] Ou, J., Nie, S., Xue, K., Zhu, F., Sun, J., Li, Z., and Li, C. Your absorbing discrete diffusion secretly models the conditional distributions of clean data. arXiv preprint arXiv:2406.03736, 2024.

[3] Nie S, Zhu F, Du C, et al. Scaling up Masked Diffusion Models on Text. arXiv preprint arXiv:2410.18514, 2024.