LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models

Motivation: The Problem with RL-based alignment in Diffusion Language Models

Masked Diffusion Models (MDMs) cannot directly compute exact log-likelihoods, take DPO as an example, we must approximate log-likelihoods using Evidence Lower Bounds:

\[\mathcal{L}_{\mathrm{DPO-E}}(\theta) = -\mathbb{E}_{(y_w, y_l)} \left[\log \sigma\left(\beta \left(\mathcal{B}_{\pi_\theta}(y_w) - \mathcal{B}_{\pi_{\mathrm{ref}}}(y_w)\right) - \beta\left(\mathcal{B}_{\pi_\theta}(y_l) - \mathcal{B}_{\pi_{\mathrm{ref}}}(y_l)\right)\right)\right]\]

Key Challenge: ELBO estimation introduces additional variance through Monte Carlo sampling, which propagates through the nonlinear log-sigmoid function, creating both bias and variance in the loss.

VRPO: Three Simple Techniques for Variance Reduction

Core Insight: We prove that both bias and variance can be bounded by the variance of the preference score estimator. Therefore, reducing this variance improves overall optimization.

1️⃣ Increased Budget

Use more samples \(n = n_{\mathrm{time}} \times n_{\mathrm{mask}}\) to estimate each ELBO

2️⃣ Optimal Allocation

Set \(n_{\mathrm{time}} = n\) and \(n_{\mathrm{mask}} = 1\) (one mask per timestep)

3️⃣ Antithetic Sampling

Share timesteps and masks between \(\pi_θ\) and \(\pi_{\mathrm{ref}}\)

Impact: VRPO improves LLaDA's performance across extensive benchmarks. Techniques 2 & 3 improve results without any additional cost.

Bibtex

Please consider cite:

@article{zhu2025llada,
    title={LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models},
    author={Zhu, Fengqi and Wang, Rongzhen and Nie, Shen and Zhang, Xiaolu and Wu, Chunwei and Hu, Jun and Zhou, Jun and Chen, Jianfei and Lin, Yankai and Wen, Ji-Rong and others},
    journal={arXiv preprint arXiv:2505.19223},
    year={2025}
}