The Mirage of Optimizing Training Policies:
Monotonic Inference Policies as the Real Objective for LLM Reinforcement Learning

1Tianjin University    2Alibaba
*Equal contribution    Corresponding authors
MIPU overview

MIPU resolves the objective misalignment of LLM RL by optimizing the inference policy instead of the training policy.

Abstract

Reinforcement learning (RL) has gained growing attention in large language model (LLM) post-training, yet RL training remains fragile and can suffer from instability or collapse. One vital cause is training-inference mismatch: LLM RL adopts separate inference and training engines for generation efficiency and training precision, which in practice exhibit inconsistent probabilities for the same trajectories on training and inference sides, even with synchronized model parameters. This naturally induces a special type of off-policyness that persistently exists and poisons the training process.

Prior works have made various efforts to address this off-policyness and stabilize the training policies under the mismatch. In this paper, we point out an overlooked issue of objective misalignment: an effective update to the policy in the training engine does not necessarily ensure the improvement of the inference policy, i.e., the policy actually used during deployment.

To this end, we propose a new policy optimization objective for LLM RL, named Monotonic Inference Policy Improvement (MIPI). Following this principle, we introduce Monotonic Inference Policy Update (MIPU), a two-step LLM RL framework that constructs sampler-referenced candidate updates and selectively accepts synchronized candidates using an inference-side gap proxy. Experiments conducted on two model scales under high mismatch demonstrate that MIPU improves both average reasoning performance and training stability.

What is Training-Inference Mismatch?

Modern LLM RL separates rollout generation from gradient computation. Responses are sampled by an inference engine (e.g., vLLM, SGLang) for efficiency, while log-probabilities and updates are computed by a training engine (e.g., FSDP, Megatron) for precision.

Even when model parameters are synchronized, implementation differences in precision, decoding, or serving backends make the training policy $\textcolor{blue}{\pi}$ and inference policy $\textcolor{red}{\mu}$ assign different probabilities to the same trajectories:

$$\textcolor{blue}{\pi_k}(y \mid x) \neq \textcolor{red}{\mu_k}(y \mid x) \quad \text{even with the same parameters.}$$

This mismatch creates a special, ever-present form of off-policyness that poisons the training process.

Objective Misalignment

Prior works reduce mismatch by correcting sampler-side ratios, filtering unstable samples, decaying learning rates, or narrowing system-level discrepancies. They aim to stabilize the training policy.

But because the mismatch always exists, improving the training policy does not guarantee improving the inference policy:

$$J(\textcolor{blue}{\pi_{k+1}}) - J(\textcolor{blue}{\pi_k}) \geq 0 \quad \nRightarrow \quad J(\textcolor{red}{\mu_{k+1}}) - J(\textcolor{red}{\mu_k}) \geq 0.$$

The right question: Whether synchronizing this update actually produces a better inference policy?

MIPI: Monotonic Inference Policy Improvement

We propose to take the inference-policy improvement as the direct optimization objective:

$$J(\textcolor{red}{\mu_{k+1}}) - J(\textcolor{red}{\mu_k}).$$

We decompose this objective into three interpretable terms:

$$J(\textcolor{red}{\mu_{k+1}}) - J(\textcolor{red}{\mu_k}) = \underbrace{J(\textcolor{red}{\mu_{k+1}}) - J(\textcolor{blue}{\pi_{k+1}})}_{\text{① post-update inference gap}} + \underbrace{J(\textcolor{blue}{\pi_{k+1}}) - J(\textcolor{blue}{\pi_k})}_{\text{② training-side update}} + \underbrace{J(\textcolor{blue}{\pi_k}) - J(\textcolor{red}{\mu_k})}_{\text{③ pre-update inference gap}}.$$

Terms ② and ③ form a sampler-referenced policy update (Step 1), while Term ① is used for inference-gap-aware acceptance (Step 2).

MIPU: Two-Step Realization

1Step 1: Sampler-Referenced Policy Update

Step 1 targets Terms \textcircled{2}+\textcircled{3}, which telescope to $J(\textcolor{blue}{\pi_{k+1}}) - J(\textcolor{red}{\mu_k})$. The key ratio is factorized as:

$$\rho_i(\theta) = \underbrace{\frac{\textcolor{blue}{\pi_k}(y_i \mid x)}{\textcolor{red}{\mu_k}(y_i \mid x)}}_{w_i^k:\ \text{pre-update mismatch}} \cdot \underbrace{\frac{\textcolor{blue}{\pi_\theta}(y_i \mid x)}{\textcolor{blue}{\pi_k}(y_i \mid x)}}_{r_i(\theta):\ \text{current update}}.$$

We adopt a Truncated Importance Sampling (TIS) update that clips only the current-update ratio $r_i(\theta)$, while truncating the pre-update mismatch weight $w_i^k$ to control variance:

$$\bar{w}_i^k = \min\left(w_i^k, w_{\max}\right), \quad w_{\max}=2.$$

This produces a candidate training policy $\textcolor{blue}{\pi_{k+1}}$ that is referenced to the sampler $\textcolor{red}{\mu_k}$ rather than to the old training policy.

Step 1 analysis

Step 1 analysis: TIS provides the best stability-performance trade-off among sampler-referenced variants.

2Step 2: Inference-Gap-Aware Update Acceptance

After synchronizing $\textcolor{blue}{\pi_{k+1}} \to \textcolor{red}{\mu_{k+1}}$, Step 2 estimates the post-update inference gap:

$$\widehat{T}_{\mathrm{post}} = -\mathbb{E}_{x\sim\mathcal{D}_{\mathrm{val}},\, y_i \sim \textcolor{red}{\mu_{k+1}}}\left[\rho_i \, \hat{A}^{\textcolor{red}{\mu_{k+1}}}_i\right], \quad \rho_i = \exp\left(\frac{1}{T_i}\sum_{t=1}^{T_i} \log \frac{\textcolor{blue}{\pi_{k+1}}(y_{i,t}\mid x,y_{i,<t})}{\textcolor{red}{\mu_{k+1}}(y_{i,t}\mid x,y_{i,<t})}\right).$$

The candidate is accepted if $\widehat{T}_{\mathrm{post}} \geq -c$, and rolled back otherwise. This ensures that training-side gains are actually realized by the inference engine.

Inference gap signal

The post-update gap proxy carries a meaningful inference-side signal across model scales.

Results

We evaluate MIPU under FP8-quantized rollout, a high-mismatch setting where quantized inference amplifies the gap between $\pi$ and $\mu$. Experiments are conducted on Qwen3-1.7B and Qwen3-4B across five mathematical reasoning benchmarks.

Main Results

Model Method MATH AIME Olympiad Minerva AMC23 Avg. Stable
Qwen3-4B Baseline 89.34 42.00 64.89 43.39 82.50 64.42
MIS 90.95 38.44 62.50 44.12 81.09 63.42
LR-decay 90.34 44.00 67.26 43.75 82.97 65.66
Ours 91.15 43.56 67.86 45.96 85.00 66.71
Qwen3-1.7B Baseline 83.10 25.33 56.55 31.68 57.66 50.86
MIS 81.29 24.67 58.33 34.19 60.16 51.73
LR-decay 82.09 26.00 58.93 28.68 65.47 52.23
Ours 86.52 24.67 59.52 33.82 65.31 53.97

Pass@1 accuracy (%) under FP8-quantized rollout. MIPU achieves the best average performance and remains stable.

Training score curve

Compared methods reach reasonable intermediate performance but later degrade; MIPU maintains a stable trajectory.

Ablation Study

Method MATH 500 AIME 24 Olympiad Minerva AMC23 Avg.
Baseline 89.34 42.00 64.89 43.39 82.50 64.42
+ Step 1 90.34 41.11 68.45 44.85 82.03 65.36
+ Step 2 90.34 40.44 64.88 43.38 75.00 62.81
Ours 91.15 43.56 67.86 45.96 85.00 66.71

Ablation on Qwen3-4B FP8-quantized rollout. Step 1 and Step 2 are complementary.

Ablation training curves

Step 1 improves the candidate update direction, while Step 2 filters unreliable synchronized candidates.

Step 2 is Not Random Rollback

We compare Step 2 against a random rollback baseline with the same rejection rate. Random rollback rejects more updates but still collapses, showing that Step 2 benefits from conditioning acceptance on the inference-gap signal rather than merely sparsifying updates.

Step 2 vs random rollback

Random rollback is more conservative but still collapses; Step 2 uses the inference-gap signal to filter harmful candidates.

Takeaway

Training-inference mismatch should not be treated only as a low-level system discrepancy. It changes the very objective of policy improvement: the policy we care about is the one used for deployment, not the one inside the training engine. MIPU turns this insight into a practical two-step algorithm and demonstrates improved reasoning performance and training stability under high mismatch.

BibTeX

@article{liang2026mipu,
  title={The Mirage of Optimizing Training Policies: Monotonic Inference Policies as the Real Objective for LLM Reinforcement Learning},
  author={Liang, Jing and Tang, Hongyao and Ma, Yi and He, Yancheng and Wang, Weixun and Li, Xiaoyang and Huang, Ju and Su, Wenbo and Liu, Jinyi and Zheng, Yan and Hao, Jianye and Zheng, Bo},
  journal={arXiv preprint arXiv:2606.29526},
  year={2026},
  url={https://anitaleungxx.github.io/MIPU/}
}