LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models

1Renmin University of China, 2Tsinghua University, 3Ant Group
* Equal contribution, § Work done during an internship at Ant Group, Project leader, Corresponding author

TL;DR: We propose VRPO to reduce gradient variance and improve preference alignment in masked diffusion language models.

llada_dpo

Motivation: The Problem with RL-based alignment in Diffusion Language Models

Masked Diffusion Models (MDMs) cannot directly compute exact log-likelihoods, take DPO as an example, we must approximate log-likelihoods using Evidence Lower Bounds:

LDPOE(θ)=E(yw,yl)[logσ(β(Bπθ(yw)Bπref(yw))β(Bπθ(yl)Bπref(yl)))]

Key Challenge: ELBO estimation introduces additional variance through Monte Carlo sampling, which propagates through the nonlinear log-sigmoid function, creating both bias and variance in the loss.

VRPO: Three Simple Techniques for Variance Reduction

Core Insight: We prove that both bias and variance can be bounded by the variance of the preference score estimator. Therefore, reducing this variance improves overall optimization.

1️⃣ Increased Budget

Use more samples n=ntime×nmask to estimate each ELBO

2️⃣ Optimal Allocation

Set ntime=n and nmask=1 (one mask per timestep)

3️⃣ Antithetic Sampling

Share timesteps and masks between πθ and πref

method

Impact: VRPO improves LLaDA's performance across extensive benchmarks. Techniques 2 & 3 improve results without any additional cost.

Bibtex

Please consider cite:

@article{zhu2025llada,
    title={LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models},
    author={Zhu, Fengqi and Wang, Rongzhen and Nie, Shen and Zhang, Xiaolu and Wu, Chunwei and Hu, Jun and Zhou, Jun and Chen, Jianfei and Lin, Yankai and Wen, Ji-Rong and others},
    journal={arXiv preprint arXiv:2505.19223},
    year={2025}
}