完整参考文献

教材与综合资源

资源	作者	年份	说明
Qwen3 Technical Report	Qwen Team	2025	Qwen3 系列完整技术报告，贯穿全课程
Reinforcement Learning from Human Feedback	Nathan Lambert	2025	第一本 RLHF 综合教材，免费获取
Hugging Face smol-course	Hugging Face	2024	开源实践迷你课程，涵盖 SFT、DPO、VLM
Hugging Face Alignment Handbook	Hugging Face	2024	生产级 SFT → DPO/ORPO 流程方案
Stanford CS336: Language Modeling from Scratch	Stanford	2025	作业5涵盖对齐与推理 RL
Intro to Post-Training	DeepLearning.AI	2025	5模块视频课程

第1课：后训练概述与监督微调基础

核心论文

1. Tülu 3: Pushing Frontiers in Open Language Model Post-Training

Lambert et al. (2024.11) — 最完整的开源后训练方案，SFT → DPO → RLVR 的黄金标准流程

2. LoRA: Low-Rank Adaptation of Large Language Models

Hu et al. (2021) — 参数高效微调奠基论文，提出低秩分解训练方法

3. QLoRA: Efficient Finetuning of Quantized LLMs

Dettmers et al. (2023) — NF4 量化 + LoRA，使大模型微调走向消费级 GPU

4. LIMA: Less Is More for Alignment

Zhou et al. (2023) — 证明 1000 条高质量数据可超越 50000 条噪声数据

5. MAGPIE: Alignment Data Synthesis from Scratch

Xu et al. (ICLR 2025) — 利用对齐模型的自动补全行为合成指令数据

扩展阅读

Scaling Data-Constrained Language Models — Muennighoff et al. (2024), 数据规模与质量的系统研究
DoRA: Weight-Decomposed Low-Rank Adaptation — Liu et al. (2024), LoRA 的改进版本
Spectrum: Targeted Training on Signal to Noise Ratio — Verma et al. (2024), 基于信噪比的层选择

第2课：SFT 进阶与数据工程

核心论文

1. Unveiling the Secret Recipe: A Guide For Supervised Fine-Tuning Small LLMs

Pareja et al. (2024.12) — 3B-7B 模型 SFT 的全面超参数指南

2. Self-Instruct: Aligning Language Models with Self-Generated Instructions

Wang et al. (2023) — 指令数据合成的开创性工作

3. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Zheng et al. (NeurIPS 2023) — LLM-as-Judge 评估框架

4. UltraChat: A Large-scale Auto-generated Multi-turn Instruction Dataset

Ding et al. (2023) — 高质量多轮对话数据集构建方法

5. Deita: Data-Efficient Instruction Tuning Alignment

Liu et al. (2024) — 数据高效选择策略

扩展阅读

Alpaca: A Strong, Replicable Instruction-Following Model — Stanford (2023), 指令微调的早期里程碑
GRAPE: Generalizing Robot Policy via Preference Alignment — 2025, 适配基座模型分布的数据选择
WizardLM: Empowering Large Language Models to Follow Complex Instructions — Xu et al. (2023)

第3课：偏好对齐——DPO 及其变体

核心论文

1. Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafailov et al. (NeurIPS 2023) — DPO 奠基论文，必读

2. SimPO: Simple Preference Optimization with a Reference-Free Reward

Meng et al. (NeurIPS 2024) — 无参考模型对齐，<10B 模型 AlpacaEval 2 排名第一

3. KTO: Model Alignment as Prospect Theoretic Optimization

Ethayarajh et al. (2024) — 行为经济学与 LLM 对齐的关联

4. Unpacking DPO and PPO: Disentangling Best Practices

Ivison et al. (2024.06) — DPO 与 PPO 的系统性控制实验

5. A Comprehensive Survey of Direct Preference Optimization

(2024.10, 更新至2025) — 覆盖 20+ DPO 变体的综述

扩展阅读

ORPO: Monolithic Preference Optimization without Reference Model — Hong et al. (2024), 将 SFT 和对齐合并为单一损失
IPO: A General Theoretical Paradigm to Understand Learning from Human Preferences — Azar et al. (Google DeepMind, 2023)
Constitutional AI: Harmlessness from AI Feedback — Bai et al. (Anthropic, 2022)

第4课：RLHF 原理与推理强化学习（GRPO）

核心论文

1. Training Language Models to Follow Instructions with Human Feedback

Ouyang et al. (OpenAI, 2022) — InstructGPT，确立 RLHF 三阶段流程

2. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via RL

DeepSeek-AI (2025.01) — 推理涌现的里程碑论文

3. DeepSeekMath: Pushing the Limits of Mathematical Reasoning

Shao et al. (2024.02) — GRPO 算法的原始论文

4. Scaling LLM Test-Time Compute Optimally

Snell et al. (2024.08) — 测试时计算扩展的理论基础

5. DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Yu et al. (ByteDance, 2025.03) — GRPO 的实用改进，完全开源

扩展阅读

Safe RLHF: Safe Reinforcement Learning from Human Feedback — 北京大学对齐团队 (ICLR 2024), 解耦有用性和无害性
Qwen3 Technical Report — Qwen Team (2025), Section 4.2 详述四阶段后训练
Dr. GRPO — MIT (2025), 去除长度偏差的 GRPO 改进
REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models — Hu (2025)
Proximal Policy Optimization Algorithms — Schulman et al. (OpenAI, 2017), PPO 原始论文

第5课：模型压缩、部署优化与能力扩展

核心论文

1. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Frantar et al. (2023) — 基于 Hessian 信息的大模型量化方法

2. AWQ: Activation-aware Weight Quantization

Lin et al. (MLSys 2024) — 激活感知量化，保护重要通道

3. Visual Instruction Tuning (LLaVA)

Liu et al. (NeurIPS 2023) — VLM 两阶段训练范式

4. ToolACE: Winning the Points of LLM Function Calling

Liu et al. (ICLR 2025) — 8B 模型函数调用超越 GPT-4

5. DeepSeek-R1 (蒸馏部分)

DeepSeek-AI (2025.01) — 将推理能力蒸馏到小模型

扩展阅读

量化方向：

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale — Dettmers et al. (2022)
SqueezeLLM: Dense-and-Sparse Quantization — Kim et al. (2023)

多模态方向：

LLaVA-OneVision: Easy Visual Task Transfer — Li et al. (2024)
InternVL: Scaling Up Vision Foundation Models — Chen et al. (2023)
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment — Yu et al. (2023)

工具使用方向：

Gorilla: Large Language Model Connected with Massive APIs — Patil et al. (2023)
Toolformer: Language Models Can Teach Themselves to Use Tools — Schick et al. (2023)
ReAct: Synergizing Reasoning and Acting in Language Models — Yao et al. (2022)

知识蒸馏方向：

Distilling the Knowledge in a Neural Network — Hinton et al. (2015), 经典蒸馏论文

开源框架与工具

框架	用途	GitHub	课程中使用
Hugging Face Transformers	模型加载与推理	github.com/huggingface/transformers	全部课次
TRL	SFT/DPO/GRPO 训练	github.com/huggingface/trl	第1-4课
PEFT	LoRA/QLoRA 适配器	github.com/huggingface/peft	第1-4课
bitsandbytes	量化工具	github.com/bitsandbytes-foundation/bitsandbytes	第1、5课
LLaMA-Factory	一站式微调框架	github.com/hiyouga/LLaMA-Factory	第5课选做
vLLM	高性能推理引擎	github.com/vllm-project/vllm	第4-5课
OpenRLHF	分布式 RLHF 框架	github.com/OpenRLHF/OpenRLHF	参考
veRL	大规模 GRPO 训练	github.com/volcengine/verl	参考
Outlines	约束解码	github.com/outlines-dev/outlines	第5课

数据集索引

数据集	类型	规模	课次	链接
UltraChat-200K	多轮对话	200K	第1课	HuggingFace
UltraFeedback	偏好数据	64K	第3课	HuggingFace
GSM8K	数学推理	8.8K	第4课	HuggingFace
COIG-CQIA	中文指令	多种	第2课	HuggingFace
PKU-SafeRLHF	安全偏好	361K	期末项目	HuggingFace
LLaVA-Instruct-150K	视觉指令	150K	期末项目	HuggingFace
Glaive FC v2	函数调用	113K	第5课/期末	HuggingFace

阶段	重点论文	阅读优先级
入门（第1-2课）	LoRA, QLoRA, LIMA	必读
对齐（第3课）	DPO, SimPO	必读
推理（第4课）	DeepSeek-R1, GRPO	必读
部署（第5课）	GPTQ 或 AWQ (选一), LLaVA	推荐
综合（第6课）	Tülu 3, Qwen3 Technical Report	必读

1. Tülu 3: Pushing Frontiers in Open Language Model Post-Training

2. LoRA: Low-Rank Adaptation of Large Language Models

3. QLoRA: Efficient Finetuning of Quantized LLMs

4. LIMA: Less Is More for Alignment

5. MAGPIE: Alignment Data Synthesis from Scratch

1. Unveiling the Secret Recipe: A Guide For Supervised Fine-Tuning Small LLMs

2. Self-Instruct: Aligning Language Models with Self-Generated Instructions

3. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

4. UltraChat: A Large-scale Auto-generated Multi-turn Instruction Dataset

5. Deita: Data-Efficient Instruction Tuning Alignment

1. Direct Preference Optimization: Your Language Model is Secretly a Reward Model

2. SimPO: Simple Preference Optimization with a Reference-Free Reward

3. KTO: Model Alignment as Prospect Theoretic Optimization

4. Unpacking DPO and PPO: Disentangling Best Practices

5. A Comprehensive Survey of Direct Preference Optimization

1. Training Language Models to Follow Instructions with Human Feedback

2. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via RL

3. DeepSeekMath: Pushing the Limits of Mathematical Reasoning

4. Scaling LLM Test-Time Compute Optimally

5. DAPO: An Open-Source LLM Reinforcement Learning System at Scale

1. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

2. AWQ: Activation-aware Weight Quantization

3. Visual Instruction Tuning (LLaVA)

4. ToolACE: Winning the Points of LLM Function Calling

5. DeepSeek-R1 (蒸馏部分)

On this page