Preference Optimisation Methods

In this post we will be taking a look at Preference Optimisation Methods that are used in LLM Post Training pipelines. We will try to understand the difference between their eerily similar objectives as well as try to get some mathematical intuition behind them. Lets go !!

RLHF, with all its intricacies of training a reward model and then using reinforcement learning to train a policy, felt a bit like building a Rube Goldberg machine to teach a model what we like. It worked, but it was notoriously complex and often unstable. DPO was a game-changer. It showed us that we could get the same results—or better—without the RL headache. But the story didn’t end there. DPO was just the beginning of a Cambrian explosion in preference optimisation research.

Direct Preference Optimisation (DPO)

The genius of DPO was in its reframing of the problem. Instead of the multi-stage RLHF process, DPO showed that we could derive the optimal policy directly from the preference data with a simple classification loss.

https://x.com/fchollet/status/1630241783111364608?ref_src=twsrc^tfw|twcamp^tweetembed|twterm^1630241783111364608|twgr^6b837de14be52921eb35099ae15f491f5e894137|twcon^s1_&ref_url=https%3A%2F%2Ficlr-blogposts.github.io%2F2024%2Fblog%2Frlhf-without-rl%2F

Think of it this way: RLHF is like training a dog (the policy) by first hiring a critic (the reward model) to learn what tricks are “good” and then having the critic give the dog treats. DPO realised you could just show the dog pairs of “good” vs. “bad” tricks and teach it directly, cutting out the critic entirely.

Screenshot 2025-09-18 at 4.28.55 PM.png

The Math That Made It Possible

It all starts with the standard RLHF objective, which aims to maximize a reward function r(x, y) while not straying too far from a reference policy πref (usually the SFT model):

$$ \max_{\pi_\theta} \mathbb{E}{x \sim \mathcal{D}, y \sim \pi\theta(y|x)} [r_\phi(x,y)] - \beta \mathbb{D}{KL}[\pi\theta(y|x) || \pi_{ref}(y|x)] $$

The key insight from the DPO paper is that this objective has a closed-form solution for the optimal policy, πr, given any reward function r:

$$ ⁍ $$

Where Z(x) is a pesky partition function that makes this hard to use directly. But here’s the magic trick: you can rearrange this equation to express the reward function in terms of the policy:

$$ r(x,y) = \beta \log\frac{\pi_r(y|x)}{\pi_{ref}(y|x)} + \beta \log Z(x) $$

Now, we plug this back into the Bradley-Terry model for preferences, which only cares about the difference in rewards between a winning (yw) and losing (yl) response. When we do this, the Z(x) term cancels out perfectly! This leaves us with an expression for the preference probability that depends only on the policies:

$$ p^(y_w > y_l | x) = \sigma\left( \beta \log\frac{\pi^(y_w|x)}{\pi_{ref}(y_w|x)} - \beta \log\frac{\pi^*(y_l|x)}{\pi_{ref}(y_l|x)} \right) $$

From here, it’s a short hop to the final DPO loss function, which is just the negative log-likelihood of this preference model: