← 返回论文库
Direct Preference Optimization (DPO)
Rafailov, et al. (Stanford) · 2023
L4.2 · Foundation Model Tech Stack
NeurIPS 2023
#alignment
CORE IDEA
闭式解直接用 preference data 训 LLM,绕过 reward model + PPO。
L-ANCHOR · 为什么在这一层重要
simpler alignment
arXiv:2305.18290 ↗
相关论文
InstructGPT: Training Language Models to Follow Instructions (RLHF)
L4.2
2022
Constitutional AI: Harmlessness from AI Feedback
L4.2
2022
DeepSeek-R1: Incentivizing Reasoning in LLMs via RL
L4.2
2025