Extragradient Preference Optimization (EGPO): Beyond Last-Iterate Convergence for Nash Learning from Human Feedback
Runlong Zhou, Maryam Fazel, Simon S. Du
We propose EGPO algorithm for Nash learning from human feedback, achieving a last-iterate linear convergence and a simple online IPO implementation.
Access abstract here