DeepSeek-R1: Incentivizing Reasoning in LLMs via RL

DeepSeek · 2025

L4.2 · Foundation Model Tech StackarXiv:2501.12948#reasoning-model#rl

CORE IDEA

GRPO 在 reasoning tasks 上做 RL 不需要 reward model，R1-Zero 纯 RL 不用 SFT 也能 emerge reasoning。

CONCRETE EXAMPLE

R1 671B 在 AIME/MATH 上接近 o1。

L-ANCHOR · 为什么在这一层重要

开源 reasoning model 起点

相关论文