← 返回论文库

Constitutional AI: Harmlessness from AI Feedback

Bai, et al. (Anthropic) · 2022

L4.2 · Foundation Model Tech StackarXiv:2212.08073#alignment

CORE IDEA

用 principles（constitution）替代人类 labeler：AI 自我评估是否符合 principles。

L-ANCHOR · 为什么在这一层重要

scalable oversight

arXiv:2212.08073 ↗

相关论文

InstructGPT: Training Language Models to Follow Instructions (RLHF)

Direct Preference Optimization (DPO)

DeepSeek-R1: Incentivizing Reasoning in LLMs via RL