← 返回论文库

Q-Learning

Watkins · 1989

L5.1 · Algorithmic FoundationsPhD thesis, Cambridge#rl

CORE IDEA

Off-policy temporal difference control，value-based RL 之祖。

L-ANCHOR · 为什么在这一层重要

RL 经典

相关论文

QuantFactor REINFORCE

DeepSeek-R1: Incentivizing Reasoning in LLMs via RL

Playing Atari with Deep Reinforcement Learning (DQN)

Mastering the Game of Go with Deep Neural Networks and Tree Search (AlphaGo)