The year of 2024 (and 2025) was for RL algorithms. Numerous new algorithms were applied for Reinforcement Learning of Large Language Models. As someone working with these methods or LLMs in general, it becomes crucial to understand the internals of these algorithms and how they relate to each other. And precisely why do we make the small tweaks when we do.
This post is a walkthrough of the key research that has shaped the use of RL in LLMs. We’ll start with a simple, foundational understanding and then journey through the many papers, breaking down their core ideas and mathematical intuitions.
At its heart, Reinforcement Learning from Human Feedback (RLHF) is like training a very, very smart pet.
The goal is to teach the LLM to generate responses that maximize its total reward. But where do these rewards come from? That’s where the “Human Feedback” part comes in. We don’t have a pre-defined reward function for what makes a “good” essay or a “helpful” answer. So, we build one from human preferences.
The foundational approach that brought RL to the forefront of LLM training was detailed in a 2017 paper by researchers from OpenAI and DeepMind, and solidified in later works like InstructGPT in 2022. The go-to algorithm for the “RL” part of this process is Proximal Policy Optimization (PPO).

Summary: The classic RLHF pipeline is a three-step process:
SFT model or reference policy ($\pi_{ref}$).