RL for LLMs : Research Walkthrough

The year of 2024 (and 2025) was for RL algorithms. Numerous new algorithms were applied for Reinforcement Learning of Large Language Models. As someone working with these methods or LLMs in general, it becomes crucial to understand the internals of these algorithms and how they relate to each other. And precisely why do we make the small tweaks when we do.

This post is a walkthrough of the key research that has shaped the use of RL in LLMs. We’ll start with a simple, foundational understanding and then journey through the many papers, breaking down their core ideas and mathematical intuitions.

The Core Idea: Teaching an LLM with Treats

At its heart, Reinforcement Learning from Human Feedback (RLHF) is like training a very, very smart pet.

The Agent: This is our LLM.
The Environment: This is the context it’s given, usually a user’s prompt.
The Action: The LLM’s action is to generate a sequence of text (a response).
The Reward: This is the “treat.” Instead of a literal treat, the model gets a numerical score indicating how “good” its response was.

The goal is to teach the LLM to generate responses that maximize its total reward. But where do these rewards come from? That’s where the “Human Feedback” part comes in. We don’t have a pre-defined reward function for what makes a “good” essay or a “helpful” answer. So, we build one from human preferences.

1. The Blueprint: RLHF and Proximal Policy Optimization (PPO)

The foundational approach that brought RL to the forefront of LLM training was detailed in a 2017 paper by researchers from OpenAI and DeepMind, and solidified in later works like InstructGPT in 2022. The go-to algorithm for the “RL” part of this process is Proximal Policy Optimization (PPO).

Methods_Diagram_light_mode.jpg.webp

Christiano et al. (2017) & Ouyang et al. (2022)

Summary: The classic RLHF pipeline is a three-step process:

Supervised Fine-Tuning (SFT): First, a pre-trained LLM is fine-tuned on a small, high-quality dataset of prompt-response pairs created by human labelers. This teaches the model the general style and format of desired responses. This model is often called the SFT model or reference policy ($\pi_{ref}$).
Reward Modeling (RM): The SFT model is used to generate several responses to a set of prompts. Human labelers then rank these responses from best to worst. This preference data is used to train a separate model—the Reward Model—whose job is to predict the numerical score a human would give to any given response.
RL Fine-Tuning with PPO: The SFT model is further fine-tuned using the Reward Model as a guide. The PPO algorithm is used to update the LLM’s policy to generate responses that get a high score from the Reward Model, without deviating too drastically from the SFT model it started as.