ash80/RLHF_in_notebooks: RLHF (Supervised fine-tuning, reward model, and PPO) step-by-step in 3 Jupyter notebooks

This repository provides a reference implementation for Reinforcement Learning from Human Feedback (RLHF) [Paper] framework presented in the RLHF from scratch, step-by-step, in code YouTube video.

RLHF is a method for aligning large language models (LLMs), like GPT-3 or GPT-2, to better meet users’ intents. It is essentially a reinforcement learning approach, where rather than directly getting the reward or feedback from some environemnt or human, it instead trains a reward model that learns to mimic that reward. The trained reward model is used to rank the generation from the LLM in the reinforcement learning step. The RLHF process consists of three steps:

Supervised Fine-Tuning (SFT)
Reward Model Training
Reinforcement Learning via Proximal Policy Optimisation (PPO).

To build a chatbot from a pretrained LLM, we might:

Collect a dataset of question-answer pairs (either human-written or generated by the pretrained model).
Human annotators rank these answers by quality.
Follow the three RLHF steps mentioned above:
1. SFT: Fine-tune the LLM to predict the next tokens given question-answer pairs.
2. Reward Model: Train another instance of the LLM with an added reward head to mimic human rankings.
3. PPO: Further optimize the fine-tuned model using PPO to produce answers that the reward model evaluates positively.

Implementation in this Repository

Instead of building a chatbot that would need a dataset of ranked questions and answers, we adapt the RLHF method to fine-tune GPT-2 to generate sentences expressing positive sentiments. To achieve this task we use the stanfordnlp/sst2 dataset, a collection of movie review sentences labeled as expressing positive or negative sentiment. Our goal is to leverage RLHF to optimise the pretrained GPT-2 such that it only generates sentences that are likely to express a positive sentiment.

We achieve this goal by implementing the following three notebooks, each corresponding to one step of the RLHF process:

1-SFT.ipynb: Fine-tunes GPT-2 via supervised learning on the stanfordnlp/sst2 dataset, training it to generate sentences resembling the sentences in this dataset. After fine-tuning, the model is saved as the SFT model.
2-RM Training.ipynb: Creates a Reward Model by attaching a reward head to the pretrained GPT-2. This model is trained to predict sentiment labels (positive/negative) of sentences in the stanfordnlp/sst2 dataset. After training, the reward model (GPT-2 + Reward Head) is saved.
3-RLHF.ipynb: Implements the final reinforcement learning step using PPO:
- Sampling stage: Generates sentences from the policy model (initialized from the SFT model) based on the initial few tokens and scores these sentences using the trained reward model.
- Optimization stage: Optimizes the policy model parameters using PPO to produce sentences that are more likely to receive higher rewards (positive sentiment scores).

After completing these steps, GPT-2 will generate sentences aligned specifically to convey positive sentiments.

Hugging Face Access Token: You will need an access token from Hugging Face to download the pretrained GPT-2 model. Obtain a token by following instructions on HuggingFace Quickstart Guide.

Local Setup:

Set your Hugging Face token as an environment variable named HF_TOKEN:

export HF_TOKEN='your_huggingface_token_here'

Google Colab:

Set your Hugging Face token in Colab Secrets. Or set it as an environment variable by running the following code in a cell of your Jupyter notebook.

import os
os.environ['HF_TOKEN'] = 'your_huggingface_token_here'

Open and run notebooks sequentially (1-SFT.ipynb, 2-RM Training.ipynb, then 3-RLHF.ipynb), following the step-by-step instructions provided within each notebook or in the YouTube video.

Source link

Basic Bliss

Recommended

Premium Perks

Ultimate Insider

Basic Bliss

Recommended

Premium Perks

Ultimate Insider

Become a member

New Chanel Le Vernis for Summer 2025

The 9 Matteau Dresses I’m Wearing All Summer—Shop Here

Absent bio children fail to take care of their aging father, his stepdaughter considers not giving them a warning if he writes them out...

How to Get New Ideas

New Chanel Le Vernis for Summer 2025

The 9 Matteau Dresses I’m Wearing All Summer—Shop Here

Absent bio children fail to take care of their aging father, his stepdaughter considers not giving them a warning if he writes them out...

How to Get New Ideas

ash80/RLHF_in_notebooks: RLHF (Supervised fine-tuning, reward model, and PPO) step-by-step in 3 Jupyter notebooks

Implementation in this Repository

Hell Yeah, ‘Cyberpunk: Edgerunners II’ is Happening

Clockwise #612: Somebody Else’s Problem Field

Mortgage Predictions for July: Will Rates Continue Falling?

Subscribe for exclusive content

Subscribe to News Inside 2 You

Basic Bliss

Recommended

Premium Perks

Ultimate Insider

Subscribe to Liberty Case

Basic Bliss

Recommended

Premium Perks

Ultimate Insider

Become a member

ash80/RLHF_in_notebooks: RLHF (Supervised fine-tuning, reward model, and PPO) step-by-step in 3 Jupyter notebooks

Implementation in this Repository

Subscribe for exclusive content