1. 大模型后训练技术#
1.1. 学习资料#
- huggingface-llm-course HuggingFace的LLM课程,主要看了11章对齐和12章推理模型。
- huggingface-smol-course HuggingFace的SMOL课程,用小模型学习对齐技术
- 工业界主流大语言模型后训练(Post-Training)技术总结
- 从零开始训练大模型
1.2. 开源工具#
- TRL HuggingFace RLHF工具库
- DeepSpeed-Chat 微软RLHF实现
- verl 火山引擎RLHF实现
- open-r1
- OpenRLHF
1.3. 研究机构#
- Sea AI Lab sea的ai实验室,新加坡
1.4. 核心模块#
1.4.1. 对齐算法#
1.4.1.1. SFT#
- [2023.08] Aligning Language Models with Offline Learning from Human Feedback conditional-sft 不同的样本有不同的权重
1.4.1.2. PEFT#
相关资料:
算法:
- LORA
- QLORA
- Adapter
- Prefix Tuning
- Prompt Tuning
- BitFit
1.4.1.3. DPO#
- [2024.05] Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF XPO在Online DPO基础上在loss上加了鼓励探索的正则项。
- [2024.04] Binary Classifier Optimization for Large Language Model Alignment BCO用BCE。奖励转移、底层分布匹配。
- [2024.03] ORPO: Monolithic Preference Optimization without Reference Model DPO基础上去掉reference model。
- [2024.02] Direct Language Model Alignment from Online AI Feedback Online DPO 结合在线数据更新,动态调整偏好数据集,缓解分布偏移。用LLM+Prompt实时对样本打标得到对比对。
- [2024.02] KTO: Model Alignment as Prospect Theoretic Optimization 基于前景理论,直接优化人类感知效用,替代传统偏好对数似然
- [2024.01] Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation CPO是负对数似然损失+偏好损失。
- [2023.12] Nash Learning from Human Feedback 在act和ref的模型上分别得到logit然后加权求和得到额外策略。
- [2023.10] A General Theoretical Paradigm to Understand Learning from Human Preferences IPO相当于 在DPO的损失函数上添加了一个正则项
- [2023.05] Direct Preference Optimization: Your Language Model is Secretly a Reward Model / 【笔记】 通过偏好数据直接优化策略,绕过显式奖励建模
1.4.1.4. RL#
- [2025.04] Seed1.5-Thinking: Advancing Superb Reasoning Models with Reinforcement Learning
- [2025.04] A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce RAFT++,
- [2025.04] VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks VAPO,seed,加上value function。
- [2025.03] DAPO: An Open-Source LLM Reinforcement Learning System at Scale DAPO,seed,调高clip上界,动态采样去掉reward为1的prompt,soft超长惩罚,去掉kl
- [2025.03] What's Behind PPO's Collapse in Long-CoT? Value Optimization Holds the Secret VC-PPO, 字节seed。long-cot的问题在于value估计不准,靠后的token的V大,A小。V做预训练,用lamada=1。Policy学习的时候对应的A用lamada=0.95减少方差,因为A不会因为V的引入bias。
- [2025.03] Understanding R1-Zero-Like Training: A Critical Perspective Dr.GRPO
- [2025.02] Process Reinforcement through Implicit Rewards PRIME
- [2025.01] REINFORCE++: An Efficient RLHF Algorithm with Robustness to Both Prompt and Reward Models 在batch内归一化,用kl散度作为正则项。
- [2024.07] ReaL: Efficient RLHF Training of Large Language Models with Parameter Reallocation
- [2024.04] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
- [2024.02] Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs RLOO,避免使用value model和GAE,减少显存占用,留一法做归一化
- [2024.01] ReFT: Reasoning with Reinforced Fine-Tuning REFT。使用跟SFT一样的数据,只是会采样更多cot,然后用PPO优化。
- [2023.10] ReMax: A Simple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models Remax, 用贪婪采样作为base。
- [2023.07] Secrets of RLHF in Large Language Models Part I: PPO PPO-max
- [2023.05] Let's Verify Step by Step
- [2023.03] Self-Refine: Iterative Refinement with Self-Feedback 用few-shot得到feedback,然后优化回答。
- [2022.12] Constitutional AI: Harmlessness from AI Feedback
-
[2022.11] Solving math word problems with process- and outcome-based feedback
-
[2022.04] Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
- [2018.06] Self-Imitation Learning
- [2017.07] Proximal Policy Optimization Algorithms
1.4.1.5. 推理和工具#
- [2022.11] PAL: Program-aided Language Models PAL
- [2022.10] ReAct: Synergizing Reasoning and Acting in Language Models ReAct,Google,query、think、action、result。
1.4.1.6. 蒸馏#
- [2023.06] On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes GKD解决训推不一致问题。
1.4.2. Reward Model#
- [2025.04] Inference-Time Scaling for Generalist Reward Modeling / 【论文解读】 SPCT 让模型自己生成原则,然后生成打分。
- [2024.06] HelpSteer2: Open-source dataset for training top-performing reward models
- [2024.03] RewardBench: Evaluating Reward Models for Language Modeling
- [2024.01] Secrets of RLHF in Large Language Models Part II: Reward Modeling
- [2023.06] PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization
1.5. 细分方向#
1.5.1. 形式化证明#
- [2025.04] Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning MiniF2F-test 80+%
- [2025.04] Leanabell-Prover: Posttraining Scaling in Formal Reasoning MiniF2F-test 59.8%
- [2024.08] DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search miniF2F-test达到63.5%
1.5.2. 角色扮演#
- [2023.03] Rewarding Chatbots for Real-World Engagement with Millions of Users Chai的论文,用RLHF优化Chatbot