← 返回论文库

InstructGPT: Training Language Models to Follow Instructions (RLHF)

Ouyang, et al. (OpenAI) · 2022

L4.2 · Foundation Model Tech StackNeurIPS 2022#alignment#rlhf

CORE IDEA

SFT + Reward Model + PPO 三段式对齐，ChatGPT 的方法学基础。

L-ANCHOR · 为什么在这一层重要

RLHF 起点

arXiv:2203.02155 ↗

相关论文

Direct Preference Optimization (DPO)

Constitutional AI: Harmlessness from AI Feedback

DeepSeek-R1: Incentivizing Reasoning in LLMs via RL