Reinforcement Learning (RL) has demonstrated its potential to elicit the reasoning ability of pre-trained Large Language Models (LLMs) via thinking and reflection during the post-training of LLMs. Despite the superiority of self-improvement empowered by RL, it comes at a significant cost of compute and time. One prominent limitation of most existing Reinforcement Finetuning (RFT) methods is that they are on-policy RL in nature, i.e., data generated during the past learning process is not well utilized. In this paper, we launch the renaissance of off-policy RL and propose Reincarnated Mix-policy Proximal Policy Gradient (ReMix), a general approach to enable on-policy RFT methods like PPO and GRPO to leverage off-policy data. ReMix consists of three major components: (1) Mix-policy proximal policy gradient with an increased Update-To-Data (UTD) ratio that leverages the data generated by both the current policy and the past polices for efficient training; (2) KL-Convex policy constraint that combines the KL constraints on the base model and the precedent model to balance the trade-off between stability and flexibility during training; (3) Policy Reincarnation that replaces the base model with the mix-policy RFT model and restarts on-policy training to achieve a seamless transition from efficient early-stage learning to stable asymptotic improvement. In our experiments, we train a series of ReMix models upon PPO, GRPO and 1.5B, 7B base models. ReMix achieves an average Pass@1 accuracy of 52.00% (for 1.5B model) within 500 training steps and 63.27% (for 7B model) within 50 training steps respectively, on five math reasoning benchmarks (i.e., AIME'24, AMC'23, Minerva, OlympiadBench, and MATH500). Moreover, ReMix reaches SOTA-level performance with an over 20x to 450x reduction in training cost in terms of rollout data volume, demonstraing superior training efficiency. In addition, we reveal insightful findings via multifaceted analysis, including the relationship between off-policy learning and the training dynamics of reasoning behaviors, the performance under response length constraint, the impact of prompt format, etc.
Starting from a base model, (1) the prevalent on-policy PPG methods (e.g., PPO, GRPO) yield a stable and effective training process, yet exhibit inefficient data utilization (i.e., the orange waved curve). (2) Off-policy PPG offers appealing potential in data efficiency. However, naively adopting off-policy PPG leads to a training collapse (i.e., the less waved green curve). (3) To strike a balance, we introduce Mix-PPG, which manages to boost early-stage performance but still faces a slow asymptotic improvement (denoted by the cyan curve) and even a collapse when adopting a high UTD ratio (i.e., the straight olive green curve). (4) To this end, we propose policy reincarnation and introduce ReMix. ReMix seamlessly takes advantage of both the efficient early-stage training of Mix-PPG and the stable asymptotic improvement of on-policy PPG (i.e., the fusion of the cyan and red curves), thereby achieving significantly better efficiency at almost no compromise of final performance.
@article{liang2025squeezesoakedspongeefficient,
title={Squeeze the Soaked Sponge: Efficient Off-policy Reinforcement Finetuning for Large Language Model},
author={Jing Liang and Hongyao Tang and Yi Ma and Jinyi Liu and Yan Zheng and Shuyue Hu and Lei Bai and Jianye Hao},
journal={arXiv preprint arXiv:2507.06892},
url={https://arxiv.org/abs/2507.06892},
year={2025}
}