← 返回论文库
Proximal Policy Optimization (PPO)
Schulman, et al. (OpenAI) · 2017
L5.1 · Algorithmic Foundations
arXiv:1707.06347
#rl
CORE IDEA
TRPO 简化版:clip 策略更新幅度,RLHF/InstructGPT 用的就是它。
L-ANCHOR · 为什么在这一层重要
policy optimization 工业标准
arXiv:1707.06347 ↗
相关论文
QuantFactor REINFORCE
L0.3
2024
DeepSeek-R1: Incentivizing Reasoning in LLMs via RL
L4.2
2025
Q-Learning
L5.1
1989
Playing Atari with Deep Reinforcement Learning (DQN)
L5.1
2013