LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models

Motivation: The Problem with RL-based alignment in Diffusion Language Models

Masked Diffusion Models (MDMs) cannot directly compute exact log-likelihoods, take DPO as an example, we must approximate log-likelihoods using Evidence Lower Bounds:

L_{DPO - E} (θ) = - E_{(y_{w}, y_{l})} [\log σ (β (B_{π_{θ}} (y_{w}) - B_{π_{ref}} (y_{w})) - β (B_{π_{θ}} (y_{l}) - B_{π_{ref}} (y_{l})))]

Key Challenge: ELBO estimation introduces additional variance through Monte Carlo sampling, which propagates through the nonlinear log-sigmoid function, creating both bias and variance in the loss.

VRPO: Three Simple Techniques for Variance Reduction

Core Insight: We prove that both bias and variance can be bounded by the variance of the preference score estimator. Therefore, reducing this variance improves overall optimization.

1️⃣ Increased Budget

Use more samples $n = n_{time} \times n_{mask}$ to estimate each ELBO

2️⃣ Optimal Allocation

Set $n_{time} = n$ and $n_{mask} = 1$ (one mask per timestep)

3️⃣ Antithetic Sampling

Share timesteps and masks between $π_{θ}$ and $π_{ref}$

Impact: VRPO improves LLaDA's performance across extensive benchmarks. Techniques 2 & 3 improve results without any additional cost.

Bibtex

Please consider cite:

@article{zhu2025llada,
    title={LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models},
    author={Zhu, Fengqi and Wang, Rongzhen and Nie, Shen and Zhang, Xiaolu and Wu, Chunwei and Hu, Jun and Zhou, Jun and Chen, Jianfei and Lin, Yankai and Wen, Ji-Rong and others},
    journal={arXiv preprint arXiv:2505.19223},
    year={2025}
}