The outstanding achievements of OpenAI’s o1 series and DeepSeek-R1 have clearly shown the power of large-scale reinforcement learning (RL) in enabling advanced reasoning abilities and greatly improving the performance of large language models (LLMs).
Despite these successes, the core training methods behind these advanced reasoning models are often not fully disclosed in their technical reports. Most recent community efforts have concentrated on mathematical reasoning, leaving the challenge of generalizing across different domains largely unaddressed. Additionally, the standard Reinforcement Learning from Preference Optimization (GRPO) training faces several common problems. These include performance bottlenecks, inefficient use of training samples, and difficulties in developing specialized reasoning skills when working with datasets that combine multiple domains. These issues make it hard to effectively scale RL methods for LLMs.
To overcome these challenges, researchers from the Kwaipilot team at Kuaishou have developed a new reinforcement learning framework called Two-Staged history-Resampling Policy Optimization (SRPO). This innovative approach is specifically designed to systematically address the training difficulties mentioned above across various aspects. The team has released a detailed technical report explaining their training method and has also open-sourced the SRPO-Qwen-32B model.
Importantly, this work represents the first time that DeepSeek-R1-Zero-level performance has been achieved simultaneously in both mathematical and coding domains. Using the same base model as DeepSeek (Qwen2.5-32B) and relying solely on reinforcement learning for training, SRPO has delivered impressive results on the AIME24 (50) and LiveCodeBench (41.6) benchmarks. These results surpass those of DeepSeek-R1-Zero-32B.
Even more striking is that SRPO reached this level of performance using only one-tenth of the training steps required by R1-Zero.
When the Kwaipilot team initially experimented with the standard GRPO algorithm, they quickly encountered several bottlenecks that prevented the model from achieving the desired R1-Zero performance. One major issue was cross-domain optimization conflicts between math and code data. Mathematical problems tend to require longer, more detailed reasoning processes, often called Long Chain-of-Thought (Long CoT), while code data generally involves shorter, more direct responses. Mixing these two types of data directly caused conflicts that led to subpar performance in both domains.
Another problem was reduced training efficiency caused by similar group rewards. GRPO calculates advantage based on the variance of non-zero rewards within a sampled group. When many rollouts in a group have nearly identical reward values, the advantage approaches zero. If this happens frequently in a training batch, the effective gradient contributions become very small, drastically lowering training efficiency.
The team also observed premature performance saturation. GRPO training often hit early plateaus and reward saturation on benchmark tests. This was partly due to insufficient data quality. When the training data lacked complexity or diversity, especially with many simple problems, the model tended to maintain conservative performance on easier tasks. This limited its ability to develop the complex, in-depth reasoning needed for harder problems.
To address the conflicting response length requirements between math and code domains, the Kwaipilot team introduced a two-stage training approach. The first stage focuses solely on challenging mathematical data. The goal here is to fully encourage the model’s test-time scaling abilities, such as reflective pausing, backtracking, and step-by-step problem decomposition.
In the second stage, code data is introduced. Building on the reasoning foundation from stage one, this phase aims to improve coding skills while progressively enhancing procedural thinking, recursion, and tool-calling abilities.
The team analyzed how different training data strategies affected response length and performance. Models trained on mixed math and code data showed limited growth in response length and poor benchmark results. While math problems triggered some reasoning patterns, code problems often led to short, direct answers focused on immediate code output with little prior analysis or planning.
Training only on math data resulted in steady increases in response length and excellent math benchmark performance. This approach fostered strong, generalizable reasoning skills. When faced with programming tasks, the model attempted detailed, step-by-step reasoning, including careful checking and revisiting of steps.
Code-only training improved code benchmark performance but showed minimal development of explicit reasoning behavior. Response lengths were shorter compared to math-only training, and code solutions were often generated directly without much stepwise reasoning or initial analysis.
The staged training method proposed by Kwaipilot produced the best results across both math and programming domains. The model consistently generated detailed step-by-step reasoning for math problems and structured reasoning for programming tasks. Notably, complex behaviors emerged, such as the model spontaneously using code to assist in mathematical reasoning.
During the mid-to-late stages of training, the team noticed that nearly half of the sampled groups in a batch produced identical rewards. This often happened when the model consistently solved easier problems, resulting in minimal reward variance and ineffective gradient updates.
To improve training efficiency and the quality of gradient signals, they introduced History Resampling. During training, they recorded the reward outcomes of all rollouts within each epoch. At the end of an epoch, they reconstructed the dataset for the next epoch based on specific criteria.
They filtered out overly simple samples where all rollouts were correct, as these provided no useful signal for policy improvement. Samples with diverse outcomes—both correct and incorrect—or all incorrect outcomes were retained. These samples generated positive reward variance, ensuring non-zero advantages and effective gradient signals. Difficult samples that were all incorrect in the current epoch were also kept, since these problems might become easier for the updated policy, producing effective gradients in future training. This approach aligns with curriculum learning principles, gradually exposing the model to increasingly challenging samples to enhance training efficiency.
Compared to the Dynamic Sampling method used in DAPO, History Resampling significantly improved computational efficiency and led to more stable growth in response length.
The Kwaipilot team carefully cleaned and filtered publicly available Code & Math datasets. They applied heuristic rules to remove irrelevant URLs and formatting noise, ensuring the completeness of core fields such as questions and answer ground truths. Following the PRIME data cleaning approach for math data, they excluded multi-part questions, pure proof-based problems, and those requiring image or table understanding. For code data, they removed problems dependent on specific environments, file I/O, or network interactions, focusing instead on algorithmic logic.
Before feeding data into training, they verified the correctness of both math and code problems to ensure answer accuracy and solvability, discarding any with incorrect or ambiguous solutions. They then assessed problem difficulty, categorizing each as easy, medium, or hard based on pass rates (Pass@k).
The experimental results using SRPO showed clear improvements. During training, the reward curve initially rose before plateauing, prompting the transition to the second stage. At the start of stage two, overall reward dropped due to the model’s lack of prior code training but then steadily increased as training continued. Introducing code data did not significantly increase response length, which matched expectations. Benchmark results showed continuous, stable improvements in both math and coding abilities, demonstrating the effectiveness of the new method.
History Resampling ensured that gradient updates remained effective at every training step by increasing the proportion of informative gradients. This improved sampling efficiency led to stable reward growth, clearly highlighting the enhanced training efficiency achieved through resampling.
The Kwaipilot team identified three key reflective reasoning patterns in the model’s behavior: rechecking, hesitation, and exploration. They statistically analyzed responses containing these patterns and recorded average response lengths. Over the course of RL training, the frequency of the model’s self-reflection, correction, and backtracking gradually increased. This indicated the emergence of a “self-verification” ability. The team suggested that this reflective behavior, similar to human cognitive processes, arises naturally from the policy optimization process.
Early in training, the model showed almost no proactive checking or reflection on previous reasoning steps. However, as training progressed, it exhibited significant reflective and backtracking behaviors. These formed response patterns such as step-by-step reasoning, numerical substitution, stepwise verification, and self-optimization.
Interestingly, the model also learned to spontaneously use program code to verify mathematical problem solutions. It would first provide a solution through mathematical reasoning and then proactively write code to check the correctness of that solution. These examples demonstrated the model’s ability to apply procedural thinking for self-correction and multiple attempts. This further indicated that in later training stages, the model had mastered broad thinking and the integrated use of various code-based reasoning methods for problem-solving.
The paper titled “SRPO: A Cross-Domain Implementation of Large-Scale Reinforcement Learning on LLM” is available on arXiv. The SRPO-Qwen-32B model can be tried on HuggingFace.
