Reinforcement Learning

Edit Me

Reinforcement learning plays a vital role in the era of large language models, particularly in reasoning and alignment. From RLHF to Chain-of-Thought reasoning, RL provides key technical foundations for enhancing LLM capabilities.

Recommended Learning Resources

Prof. Shiyu Zhao's RL Course (Westlake University)

Highlights: The mathematical foundations of reinforcement learning, from scratch to deep understanding

Resources:

Book & Slides: GitHub repository
Video lectures: Full course on Bilibili
Why this course: Rigorous math derivations with a solid theoretical foundation

RethinkFun RL Series

Recommended creator: @RethinkFun

Core videos:

RL fundamentals explained — plain-language explanations with illustrated principles and formula derivations
PPO and GRPO algorithm walkthroughs — illustrated algorithm internals

Beginner Tutorials

Minimal Introduction to Reinforcement Learning — an intuitive walkthrough of MDP, DP/MC/TD, Q-learning, policy gradients, and PPO

RL Papers and Projects (Pending Review)

📊 Paper shortlist: Click to view the full list of RL papers pending review

Status categories: Not Started, Evaluating, Completed, Not Recommended, Not Open-Sourced

GRPO Reproduction References

TRL Framework

Project: TRL (Transformer Reinforcement Learning)
Highlights: HuggingFace's official RL framework
Supported algorithms: GRPO, PPO, DPO, and more

Chain-of-Thought (CoT)

Core Concept

Chain-of-Thought reasoning is a key technique for exposing an LLM's reasoning process, improving both interpretability and reasoning capability.

Notable Papers and Projects

CoT-Valve: Length-Compressible Chain-of-Thought Tuning

Highlights: A technique for tuning Chain-of-Thought reasoning with compressible length
Source: HuggingFace Daily Papers

MCoT (Multi-Chain-of-Thought)

Project: Awesome-MCoT
Highlights: Multi-chain reasoning that improves performance on complex reasoning tasks

Latent CoT

Project: Awesome-Latent-CoT
Core idea: Move reasoning from linguistic symbols into the latent space to capture richer and more complex thought processes

Multimodal CoT

Chain-of-Thought reasoning that combines visual and textual information — showing strong capability on multimodal tasks.

Survey on Latent-Space Reasoning

Key survey: HIT's first survey on latent-space reasoning
Core argument: Reshapes the boundaries of LLM reasoning by exploring reasoning mechanisms in the latent space

DeepSeek-R1 Deep Dive

DeepSeek-R1, as a model with standout reasoning capabilities, has technical details worth in-depth study:

Reasoning mechanism design
Training strategy analysis
Performance evaluation methodology

Suggested Learning Path

Foundations

Math prerequisites: probability, dynamic programming, optimization theory
Core concepts: MDP, value functions, policies, returns
Classical algorithms: Q-learning, policy gradients, Actor-Critic

Intermediate

Modern algorithms: PPO, TRPO, SAC, TD3
LLM applications: RLHF, Constitutional AI
Chain-of-Thought techniques: CoT, MCoT, Latent CoT

Practice

Frameworks: OpenAI Gym, Stable Baselines3, TRL
Hands-on projects: game AI, dialogue system optimization
Paper reproduction: reproducing and improving upon key algorithms

Application Areas

LLM Alignment

RLHF: Learning from human feedback to improve output quality
Constitutional AI: Principle-based approach to AI alignment
DPO: Direct Preference Optimization — a simplified alternative to RLHF

Reasoning Capability

Chain-of-Thought reasoning: Improves performance on complex reasoning tasks
Tool use: Training models to invoke external tools
Code generation: Improves programming capability

Multi-Agent Systems

Cooperative learning: Multiple agents solve problems together
Competitive learning: Individual capabilities improve through competition
Social learning: Agents learn from the behavior of other agents

Frontier Trends

Offline RL: Learning policies from static datasets
Meta-learning: Algorithms that adapt quickly to new tasks
Safe RL: Ensuring the safety of both the learning process and the learned policy
Explainable RL: Improving the interpretability of decision-making

Additional Notes

Chain-of-Thought / Multi-step CoT (MCoT) / Latent CoT
GRPO learning resources:
- Bilibili playlist: https://space.bilibili.com/18235884/search?keyword=GRPO
- PPO/GRPO algorithm explainer: https://www.bilibili.com/video/BV15cZYYvEhz/
- Paper and resource collection: https://github.com/yaotingwangofficial/Awesome-MCoT
RL math foundations textbook: GitHub https://github.com/MathFoundationRL/Book-Mathmatical-Foundation-of-Reinforcement-Learning

贡献者

Was this page helpful?