Large Language Diffusion Models

What is now proved was once only imagined. ——— William Blake

Motivation

We contend that the intelligence of LLMs—manifested in scalability, instruction-following, in-context learning, conversational ability, and compression—stems not from the autoregressive mechanism per se, but rather from the core principle of generative modeling: approximating the true language distribution through maximum likelihood estimation.

We introduce LLaDA (Large Language Diffusion with mAsking), a simple yet principled generative paradigm for large language models that demonstrates the aforementioned remarkable capabilities.

Method

LLaDA is a masked diffusion model [1, 2, 3] that follows standard pretraining and SFT while sampling via diffusion. During pretraining, it masks all tokens randomly at ratio \( t ∼ U[0,1] \); in SFT, only response tokens may be masked. The model simulates diffusion from full masking (\(t = 1\)) to unmasking (\(t = 0\)), predicting all masks simultaneously at each step with flexible remasking.

Scalability

LLaDA demonstrates impressive scalability, with its overall trend being highly competitive with that of autoregressive baseline on the same data.

A text generation method different from the traditional left-to-right approach

Prompt: Explain what artificial intelligence is.

Case Studies

Math problem

User

Emma decides to run 4 sprints 5 times a week. She runs 75 meters each sprint. How many total meters does she run a week?

LLaDA-Instruct

Emma runs 4 sprints each time, so she runs a total of 4 x 5 = 20 sprints a week. Each sprint is 75 meters long, so Emma runs a total of 20 x 75 = 1500 meters a week. The final result is 1500.

Daily life

User

Please recommend me three famous movies

LLaDA-Instruct

Certainly! Here are three famous movies that have achieved widespread popularity and critical acclaim:
1. The Shawshank Redemption (1994) - Inspired by a novella and directed by Frank Darabont, this film tells the story of Andy Dufresne, a man convicted of murdering his wife, dealing with his harsh life in Shawshank Prison and forming an unlikely friendship with a fellow inmate.
2. The Godfather (1972) - Directed by Francis Ford Coppola, this film is based on Mario Puzo's novel of the same name. It follows the story of the Corleone family, particularly Michael Corleone, who rises to power and takes over the family's criminal empire.
3. The Empire Strikes Back (1980) - Directed by George Lucas, this film continues the story of a young Skywalker in the Star Wars saga. It follows Luke Skywalker as he helps the Rebel Alliance fight against the Galactic Empire and destroy the powerful weapon, the Death Star.
All of these films are celebrated for their cinematic achievements, powerful storytelling, and memorable characters.

Code generation

User

I am writing python code, I have an array, please help me sort it from largest to smallest

LLaDA-Instruct

Certainly! You can sort an array from the largest to smallest in Python using the ‘sort()’ method or the ‘sorted()‘ function. Here’s an example:
### Using ‘sort()’ method:
my array = [3, 1, 4, 1, 5, 2]
my array.sort(reverse=True)
print(my array)

Translation (English → Chinese)

User

Please help me translate into Chinese: ‘What is now proved was once only imagined’

LLaDA-Instruct

现在被证明的曾经只是想象

Translation (English → German)

User

Please help me translate into German: ‘What is now proved was once only imagined’

LLaDA-Instruct

Wer heute bestätigt ist, war einst nur immaginiert.

Multi-turn dialogue case 1

User

Could you please share with me what the initial two lines of the renowned poem `The Road Not Taken'?

LLaDA-Instruct

Certainly! The initial two lines of the renowned poem ``The Road Not Taken" by Robert Frost are: ``Two roads diverged in a yellow wood, And sorry I could not travel both." These lines set the stage for the poem, introducing the idea of a speaker facing a choice between two paths.

User

Please help me translate into Chinese.

LLaDA-Instruct

Sure! The first two lines of ``The Road Not Taken" by Robert Frost can be translated into Chinese as: ``两条路分岔在黄色的树林中，遗憾我不能同时走''

User

Please translate into German.

LLaDA-Instruct

Sure! The first two lines of ``The Road Not Taken" by Robert Frost can be translated into German as: ``Zwei Wege trennten sich im gelben Wald, und leider konnte ich nicht beide tun."

User

Please also write a poem about life choices. I require 5 sentences, and the first word of each sentence starts with C as the first letter.

LLaDA-Instruct

Certainly! Here's a poem about life choices, with each sentence starting with the letter ``C": Choices cascade through life's journey, Creating connections with each decision, Careful consideration leads to better paths, Courageous choices shape our destiny, Choosing wisely guides us through the years.

Multi-turn dialogue case 2

User

Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?

LLaDA-Instruct

Natalia sold clips to 48 friends in April. She sold half as many clips in May, so she sold 48/2 = 24 clips in May. Altogether, Natalia sold 48 + 24 = 72 clips in April and May. The final result is 72.

User

Natalia sold three times as many items in June as she did in May. How many clips did Natalia sell altogether in these three months?

LLaDA-Instruct

Natalia sold clips to 48 friends in April. She sold half as many clips in May, so she sold 48/2 = 24 clips in May. Natalia sold three times as many items in June as she did in May, so she sold 24 * 3 = 72 clips in June. Altogether, Natalia sold 48 + 24 + 72 = 144 clips in these three months. The final result is 144.

BibTeX

@misc{nie2025largelanguagediffusionmodels,
      title={Large Language Diffusion Models}, 
      author={Shen Nie and Fengqi Zhu and Zebin You and Xiaolu Zhang and Jingyang Ou and Jun Hu and Jun Zhou and Yankai Lin and Ji-Rong Wen and Chongxuan Li},
      year={2025},
      eprint={2502.09992},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.09992}, 
}

References

[1] Austin, J., Johnson, D. D., Ho, J., Tarlow, D., and Van Den Berg, R. Structured denoising diffusion models in discrete state-spaces. Advances in Neural Information Processing Systems, 34:17981–17993, 2021a.

[2] Ou, J., Nie, S., Xue, K., Zhu, F., Sun, J., Li, Z., and Li, C. Your absorbing discrete diffusion secretly models the conditional distributions of clean data. arXiv preprint arXiv:2406.03736, 2024.

[3] Nie S, Zhu F, Du C, et al. Scaling up Masked Diffusion Models on Text[J]. arXiv preprint arXiv:2410.18514, 2024.