The Crucial Role of Samplers in Online Direct Preference Optimization
Ruizhe Shi*, Runlong Zhou*, Simon S. Du
We prove that online DPO with a mixture of samplers achieves quadratic convergence with exact gradients and linear convergence with estimations.
Access abstract here