Reinforcement Learning
Reinforcement learning plays a vital role in the era of large language models, particularly in reasoning and alignment. From RLHF to Chain-of-Thought reasoning, RL provides key technical foundations for enhancing LLM capabilities.
Recommended Learning Resources
Prof. Shiyu Zhao's RL Course (Westlake University)
Highlights: The mathematical foundations of reinforcement learning, from scratch to deep understanding
Resources:
- Book & Slides: GitHub repository
- Video lectures: Full course on Bilibili
- Why this course: Rigorous math derivations with a solid theoretical foundation
RethinkFun RL Series
Recommended creator: @RethinkFun
Core videos:
- RL fundamentals explained — plain-language explanations with illustrated principles and formula derivations
- PPO and GRPO algorithm walkthroughs — illustrated algorithm internals
Beginner Tutorials
- Minimal Introduction to Reinforcement Learning — an intuitive walkthrough of MDP, DP/MC/TD, Q-learning, policy gradients, and PPO
RL Papers and Projects (Pending Review)
📊 Paper shortlist: Click to view the full list of RL papers pending review
Status categories: Not Started, Evaluating, Completed, Not Recommended, Not Open-Sourced
GRPO Reproduction References
TRL Framework
- Project: TRL (Transformer Reinforcement Learning)
- Highlights: HuggingFace's official RL framework
- Supported algorithms: GRPO, PPO, DPO, and more
Chain-of-Thought (CoT)
Core Concept
Chain-of-Thought reasoning is a key technique for exposing an LLM's reasoning process, improving both interpretability and reasoning capability.
Notable Papers and Projects
CoT-Valve: Length-Compressible Chain-of-Thought Tuning
- Highlights: A technique for tuning Chain-of-Thought reasoning with compressible length
- Source: HuggingFace Daily Papers
MCoT (Multi-Chain-of-Thought)
- Project: Awesome-MCoT
- Highlights: Multi-chain reasoning that improves performance on complex reasoning tasks
Latent CoT
- Project: Awesome-Latent-CoT
- Core idea: Move reasoning from linguistic symbols into the latent space to capture richer and more complex thought processes
Multimodal CoT
Chain-of-Thought reasoning that combines visual and textual information — showing strong capability on multimodal tasks.
Survey on Latent-Space Reasoning
- Key survey: HIT's first survey on latent-space reasoning
- Core argument: Reshapes the boundaries of LLM reasoning by exploring reasoning mechanisms in the latent space
DeepSeek-R1 Deep Dive
DeepSeek-R1, as a model with standout reasoning capabilities, has technical details worth in-depth study:
- Reasoning mechanism design
- Training strategy analysis
- Performance evaluation methodology
Suggested Learning Path
Foundations
- Math prerequisites: probability, dynamic programming, optimization theory
- Core concepts: MDP, value functions, policies, returns
- Classical algorithms: Q-learning, policy gradients, Actor-Critic
Intermediate
- Modern algorithms: PPO, TRPO, SAC, TD3
- LLM applications: RLHF, Constitutional AI
- Chain-of-Thought techniques: CoT, MCoT, Latent CoT
Practice
- Frameworks: OpenAI Gym, Stable Baselines3, TRL
- Hands-on projects: game AI, dialogue system optimization
- Paper reproduction: reproducing and improving upon key algorithms
Application Areas
LLM Alignment
- RLHF: Learning from human feedback to improve output quality
- Constitutional AI: Principle-based approach to AI alignment
- DPO: Direct Preference Optimization — a simplified alternative to RLHF
Reasoning Capability
- Chain-of-Thought reasoning: Improves performance on complex reasoning tasks
- Tool use: Training models to invoke external tools
- Code generation: Improves programming capability
Multi-Agent Systems
- Cooperative learning: Multiple agents solve problems together
- Competitive learning: Individual capabilities improve through competition
- Social learning: Agents learn from the behavior of other agents
Frontier Trends
- Offline RL: Learning policies from static datasets
- Meta-learning: Algorithms that adapt quickly to new tasks
- Safe RL: Ensuring the safety of both the learning process and the learned policy
- Explainable RL: Improving the interpretability of decision-making
Additional Notes
-
Chain-of-Thought / Multi-step CoT (MCoT) / Latent CoT
-
GRPO learning resources:
- Bilibili playlist: https://space.bilibili.com/18235884/search?keyword=GRPO
- PPO/GRPO algorithm explainer: https://www.bilibili.com/video/BV15cZYYvEhz/
- Paper and resource collection: https://github.com/yaotingwangofficial/Awesome-MCoT
-
RL math foundations textbook: GitHub https://github.com/MathFoundationRL/Book-Mathmatical-Foundation-of-Reinforcement-Learning
贡献者
github-actions[bot]贡献 1 次 · 最近 2026/05/11
longsizhuo贡献 1 次 · 最近 2026/05/06
Mira190贡献 1 次 · 最近 2025/09/17
Recent Updates
Involution Hell© 2026 byCommunityunderCC BY-NC-SA 4.0
PPO
Involution Hell's technical documentation on the PPO algorithm: a detailed walkthrough of the complete training pipeline (prompt batch → actor → reward model → critic) and core implementation mechanisms, including importance sampling, clipping, and KL constraint. Ideal for reinforcement learning beginners, AI job interview candidates, and CS algorithm learners.
Free AI Productivity Tools with Your Student Email
A series on AI productivity tools available for free with a student email, featuring in-depth usage tips for Perplexity Comet, Cursor, and V0.dev — covering AI-powered search, code generation, and frontend development. Ideal for current students, job-seeking developers, and CS/AI learners who want to maximize free resources and boost their productivity.