Large Language Diffusion Models

1Renmin University of China, 2Ant Group

TL;DR: We introduce LLaDA, a diffusion model with an unprecedented 8B scale, trained entirely from scratch, rivaling LLaMA3 8B in performance.

LLaDA_vs_LLaMA
LLaDA_vs_LLaMA_chat
What is now proved was once only imagined. ——— William Blake

Motivation

We contend that the intelligence of LLMs—manifested in scalability, instruction-following, in-context learning, conversational ability, and compression—stems not from the autoregressive mechanism per se, but rather from the core principle of generative modeling: approximating the true language distribution through maximum likelihood estimation.

We introduce LLaDA (Large Language Diffusion with mAsking), a simple yet principled generative paradigm for large language models that demonstrates the aforementioned remarkable capabilities.

Method

LLaDA is a masked diffusion model [1, 2] that follows standard pretraining and SFT while sampling via diffusion. During pretraining, it masks all tokens randomly at ratio \( t ∼ U[0,1] \); in SFT, only response tokens may be masked. The model simulates diffusion from full masking (\(t = 1\)) to unmasking (\(t = 0\)), predicting all masks simultaneously at each step with flexible remasking.

method

Scalability

LLaDA demonstrates impressive scalability, with its overall trend being highly competitive with that of autoregressive baseline on the same data.

scaling_mmlu scaling_arc_c scaling_cmmlu
scaling_piqa scaling_gsm8k scaling_humaneval

A text generation method different from the traditional left-to-right approach

Prompt: Explain what artificial intelligence is.

diff_normal_150ms.gif

Case Studies

Math problem
User
Emma decides to run 4 sprints 5 times a week. She runs 75 meters each sprint. How many total meters does she run a week?
LLaDA-Instruct
Emma runs 4 sprints each time, so she runs a total of 4 x 5 = 20 sprints a week. Each sprint is 75 meters long, so Emma runs a total of 20 x 75 = 1500 meters a week. The final result is 1500.
Daily life
User
Please recommend me three famous movies
LLaDA-Instruct
Certainly! Here are three famous movies that have achieved widespread popularity and critical acclaim:
1. The Shawshank Redemption (1994) - Inspired by a novella and directed by Frank Darabont, this film tells the story of Andy Dufresne, a man convicted of murdering his wife, dealing with his harsh life in Shawshank Prison and forming an unlikely friendship with a fellow inmate.
2. The Godfather (1972) - Directed by Francis Ford Coppola, this film is based on Mario Puzo's novel of the same name. It follows the story of the Corleone family, particularly Michael Corleone, who rises to power and takes over the family's criminal empire.
3. The Empire Strikes Back (1980) - Directed by George Lucas, this film continues the story of a young Skywalker in the Star Wars saga. It follows Luke Skywalker as he helps the Rebel Alliance fight against the Galactic Empire and destroy the powerful weapon, the Death Star.
All of these films are celebrated for their cinematic achievements, powerful storytelling, and memorable characters.
Code generation
User
I am writing python code, I have an array, please help me sort it from largest to smallest
LLaDA-Instruct
Certainly! You can sort an array from the largest to smallest in Python using the ‘sort()’ method or the ‘sorted()‘ function. Here’s an example:
### Using ‘sort()’ method:
my array = [3, 1, 4, 1, 5, 2]
my array.sort(reverse=True)
print(my array)
Translation (English → Chinese)
User
Please help me translate into Chinese: ‘What is now proved was once only imagined’
LLaDA-Instruct
现在被证明的曾经只是想象
Translation (English → German)
User
Please help me translate into German: ‘What is now proved was once only imagined’
LLaDA-Instruct
Wer heute bestätigt ist, war einst nur immaginiert.
Multi-turn dialogue case 1
User
Could you please share with me what the initial two lines of the renowned poem `The Road Not Taken'?
LLaDA-Instruct
Certainly! The initial two lines of the renowned poem ``The Road Not Taken" by Robert Frost are: ``Two roads diverged in a yellow wood, And sorry I could not travel both." These lines set the stage for the poem, introducing the idea of a speaker facing a choice between two paths.
User
Please help me translate into Chinese.
LLaDA-Instruct
Sure! The first two lines of ``The Road Not Taken" by Robert Frost can be translated into Chinese as: ``两条路分岔在黄色的树林中,遗憾我不能同时走''
User
Please translate into German.
LLaDA-Instruct
Sure! The first two lines of ``The Road Not Taken" by Robert Frost can be translated into German as: ``Zwei Wege trennten sich im gelben Wald, und leider konnte ich nicht beide tun."
User
Please also write a poem about life choices. I require 5 sentences, and the first word of each sentence starts with C as the first letter.
LLaDA-Instruct
Certainly! Here's a poem about life choices, with each sentence starting with the letter ``C": Choices cascade through life's journey, Creating connections with each decision, Careful consideration leads to better paths, Courageous choices shape our destiny, Choosing wisely guides us through the years.
Multi-turn dialogue case 2
User
Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?
LLaDA-Instruct
Natalia sold clips to 48 friends in April. She sold half as many clips in May, so she sold 48/2 = 24 clips in May. Altogether, Natalia sold 48 + 24 = 72 clips in April and May. The final result is 72.
User
Natalia sold three times as many items in June as she did in May. How many clips did Natalia sell altogether in these three months?
LLaDA-Instruct
Natalia sold clips to 48 friends in April. She sold half as many clips in May, so she sold 48/2 = 24 clips in May. Natalia sold three times as many items in June as she did in May, so she sold 24 * 3 = 72 clips in June. Altogether, Natalia sold 48 + 24 + 72 = 144 clips in these three months. The final result is 144.

BibTeX

@misc{nie2025largelanguagediffusionmodels,
      title={Large Language Diffusion Models}, 
      author={Shen Nie and Fengqi Zhu and Zebin You and Xiaolu Zhang and Jingyang Ou and Jun Hu and Jun Zhou and Yankai Lin and Ji-Rong Wen and Chongxuan Li},
      year={2025},
      eprint={2502.09992},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.09992}, 
}

References

[1] Austin, J., Johnson, D. D., Ho, J., Tarlow, D., and Van Den Berg, R. Structured denoising diffusion models in discrete state-spaces. Advances in Neural Information Processing Systems, 34:17981–17993, 2021a.

[2] Ou, J., Nie, S., Xue, K., Zhu, F., Sun, J., Li, Z., and Li, C. Your absorbing discrete diffusion secretly models the conditional distributions of clean data. arXiv preprint arXiv:2406.03736, 2024.