Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO
Ruizhe Shi*, Minhak Song*, Runlong Zhou, Zihan Zhang, Maryam Fazel, Simon S. Du
We theoretically study the separation between RLHF and DPO when the optimization step is exact while the policy model and reward model are differently mis-specified, and when only finite samples are accessible.
Access abstract here