The Crucial Role of Samplers in Online Direct Preference Optimization

Ruizhe Shi*, Runlong Zhou*, Simon S. Du

We prove that online DPO with a mixture of samplers achieves quadratic convergence with exact gradients and linear convergence with estimations.

Access abstract here