The Mirage of Optimizing Training Policies: Monotonic Inference Policies as the Real Objective for LLM Reinforcement Learning

Liang, Jing; Tang, Hongyao; Ma, Yi; He, Yancheng; Wang, Weixun; Li, Xiaoyang; Huang, Ju; Su, Wenbo; Liu, Jinyi; Zheng, Yan; Hao, Jianye; Zheng, Bo

The Mirage of Optimizing Training Policies:
Monotonic Inference Policies as the Real Objective for LLM Reinforcement Learning

Jing Liang^*1,2, Hongyao Tang^*1, Yi Ma^*1, Yancheng He^†2, Weixun Wang^†2, Xiaoyang Li²,
Ju Huang², Wenbo Su², Jinyi Liu¹, Yan Zheng^†1, Jianye Hao^†1, Bo Zheng²

¹Tianjin University ²Alibaba
^*Equal contribution ^†Corresponding authors

Paper arXiv

MIPU resolves the objective misalignment of LLM RL by optimizing the inference policy instead of the training policy.

Abstract

Reinforcement learning (RL) has gained growing attention in large language model (LLM) post-training, yet RL training remains fragile and can suffer from instability or collapse. One vital cause is training-inference mismatch: LLM RL adopts separate inference and training engines for generation efficiency and training precision, which in practice exhibit inconsistent probabilities for the same trajectories on training and inference sides, even with synchronized model parameters. This naturally induces a special type of off-policyness that persistently exists and poisons the training process.

Prior works have made various efforts to address this off-policyness and stabilize the training policies under the mismatch. In this paper, we point out an overlooked issue of objective misalignment: an effective update to the policy in the training engine does not necessarily ensure the improvement of the inference policy, i.e., the policy actually used during deployment.

To this end, we propose a new policy optimization objective for LLM RL, named Monotonic Inference Policy Improvement (MIPI). Following this principle, we introduce Monotonic Inference Policy Update (MIPU), a two-step LLM RL framework that constructs sampler-referenced candidate updates and selectively accepts synchronized candidates using an inference-side gap proxy. Experiments conducted on two model scales under high mismatch demonstrate that MIPU improves both average reasoning performance and training stability.

What is Training-Inference Mismatch?

Modern LLM RL separates rollout generation from gradient computation. Responses are sampled by an inference engine (e.g., vLLM, SGLang) for efficiency, while log-probabilities and updates are computed by a training engine (e.g., FSDP, Megatron) for precision.

Even when model parameters are synchronized, implementation differences in precision, decoding, or serving backends make the training policy $\textcolor{blue}{\pi}$ and inference policy $\textcolor{red}{\mu}$ assign different probabilities to the same trajectories:

$$\textcolor{blue}{\pi_k}(y \mid x) \neq \textcolor{red}{\mu_k}(y \mid x) \quad \text{even with the same parameters.}$$

This mismatch creates a special, ever-present form of off-policyness that poisons the training process.

Objective Misalignment

Prior works reduce mismatch by correcting sampler-side ratios, filtering unstable samples, decaying learning rates, or narrowing system-level discrepancies. They aim to stabilize the training policy.

But because the mismatch always exists, improving the training policy does not guarantee improving the inference policy:

$$J(\textcolor{blue}{\pi_{k+1}}) - J(\textcolor{blue}{\pi_k}) \geq 0 \quad \nRightarrow \quad J(\textcolor{red}{\mu_{k+1}}) - J(\textcolor{red}{\mu_k}) \geq 0.$$

The right question: Whether synchronizing this update actually produces a better inference policy?

MIPI: Monotonic Inference Policy Improvement

We propose to take the inference-policy improvement as the direct optimization objective:

$$J(\textcolor{red}{\mu_{k+1}}) - J(\textcolor{red}{\mu_k}).$$

We decompose this objective into three interpretable terms:

$$J(\textcolor{red}{\mu_{k+1}}) - J(\textcolor{red}{\mu_k}) = \underbrace{J(\textcolor{red}{\mu_{k+1}}) - J(\textcolor{blue}{\pi_{k+1}})}_{\text{① post-update inference gap}} + \underbrace{J(\textcolor{blue}{\pi_{k+1}}) - J(\textcolor{blue}{\pi_k})}_{\text{② training-side update}} + \underbrace{J(\textcolor{blue}{\pi_k}) - J(\textcolor{red}{\mu_k})}_{\text{③ pre-update inference gap}}.$$

Terms ② and ③ form a sampler-referenced policy update (Step 1), while Term ① is used for inference-gap-aware acceptance (Step 2).

MIPU: Two-Step Realization

1Step 1: Sampler-Referenced Policy Update

Step 1 targets Terms \textcircled{2}+\textcircled{3}, which telescope to $J(\textcolor{blue}{\pi_{k+1}}) - J(\textcolor{red}{\mu_k})$. The key ratio is factorized as:

$$\rho_i(\theta) = \underbrace{\frac{\textcolor{blue}{\pi_k}(y_i \mid x)}{\textcolor{red}{\mu_k}(y_i \mid x)}}_{w_i^k:\ \text{pre-update mismatch}} \cdot \underbrace{\frac{\textcolor{blue}{\pi_\theta}(y_i \mid x)}{\textcolor{blue}{\pi_k}(y_i \mid x)}}_{r_i(\theta):\ \text{current update}}.$$

We adopt a Truncated Importance Sampling (TIS) update that clips only the current-update ratio $r_i(\theta)$, while truncating the pre-update mismatch weight $w_i^k$ to control variance:

$$\bar{w}_i^k = \min\left(w_i^k, w_{\max}\right), \quad w_{\max}=2.$$

This produces a candidate training policy $\textcolor{blue}{\pi_{k+1}}$ that is referenced to the sampler $\textcolor{red}{\mu_k}$ rather than to the old training policy.

Step 1 analysis: TIS provides the best stability-performance trade-off among sampler-referenced variants.

2Step 2: Inference-Gap-Aware Update Acceptance

After synchronizing $\textcolor{blue}{\pi_{k+1}} \to \textcolor{red}{\mu_{k+1}}$, Step 2 estimates the post-update inference gap:

$$\widehat{T}_{\mathrm{post}} = -\mathbb{E}_{x\sim\mathcal{D}_{\mathrm{val}},\, y_i \sim \textcolor{red}{\mu_{k+1}}}\left[\rho_i \, \hat{A}^{\textcolor{red}{\mu_{k+1}}}_i\right], \quad \rho_i = \exp\left(\frac{1}{T_i}\sum_{t=1}^{T_i} \log \frac{\textcolor{blue}{\pi_{k+1}}(y_{i,t}\mid x,y_{i,<t})}{\textcolor{red}{\mu_{k+1}}(y_{i,t}\mid x,y_{i,<t})}\right).$$

The candidate is accepted if $\widehat{T}_{\mathrm{post}} \geq -c$, and rolled back otherwise. This ensures that training-side gains are actually realized by the inference engine.

The post-update gap proxy carries a meaningful inference-side signal across model scales.

Results

We evaluate MIPU under FP8-quantized rollout, a high-mismatch setting where quantized inference amplifies the gap between $\pi$ and $\mu$. Experiments are conducted on Qwen3-1.7B and Qwen3-4B across five mathematical reasoning benchmarks.

Main Results

Model	Method	MATH	AIME	Olympiad	Minerva	AMC23	Avg.	Stable
Qwen3-4B	Baseline	89.34	42.00	64.89	43.39	82.50	64.42	✘
	MIS	90.95	38.44	62.50	44.12	81.09	63.42	✘
	LR-decay	90.34	44.00	67.26	43.75	82.97	65.66	✘
	Ours	91.15	43.56	67.86	45.96	85.00	66.71	✔
Qwen3-1.7B	Baseline	83.10	25.33	56.55	31.68	57.66	50.86	✘
	MIS	81.29	24.67	58.33	34.19	60.16	51.73	✘
	LR-decay	82.09	26.00	58.93	28.68	65.47	52.23	✘
	Ours	86.52	24.67	59.52	33.82	65.31	53.97	✔

Pass@1 accuracy (%) under FP8-quantized rollout. MIPU achieves the best average performance and remains stable.

Compared methods reach reasonable intermediate performance but later degrade; MIPU maintains a stable trajectory.

Ablation Study

Method	MATH 500	AIME 24	Olympiad	Minerva	AMC23	Avg.
Baseline	89.34	42.00	64.89	43.39	82.50	64.42
+ Step 1	90.34	41.11	68.45	44.85	82.03	65.36
+ Step 2	90.34	40.44	64.88	43.38	75.00	62.81
Ours	91.15	43.56	67.86	45.96	85.00	66.71

Ablation on Qwen3-4B FP8-quantized rollout. Step 1 and Step 2 are complementary.

Step 1 improves the candidate update direction, while Step 2 filters unreliable synchronized candidates.

Step 2 is Not Random Rollback

We compare Step 2 against a random rollback baseline with the same rejection rate. Random rollback rejects more updates but still collapses, showing that Step 2 benefits from conditioning acceptance on the inference-gap signal rather than merely sparsifying updates.

Random rollback is more conservative but still collapses; Step 2 uses the inference-gap signal to filter harmful candidates.

Takeaway

Training-inference mismatch should not be treated only as a low-level system discrepancy. It changes the very objective of policy improvement: the policy we care about is the one used for deployment, not the one inside the training engine. MIPU turns this insight into a practical two-step algorithm and demonstrates improved reasoning performance and training stability under high mismatch.

BibTeX

@article{liang2026mipu,
  title={The Mirage of Optimizing Training Policies: Monotonic Inference Policies as the Real Objective for LLM Reinforcement Learning},
  author={Liang, Jing and Tang, Hongyao and Ma, Yi and He, Yancheng and Wang, Weixun and Li, Xiaoyang and Huang, Ju and Su, Wenbo and Liu, Jinyi and Zheng, Yan and Hao, Jianye and Zheng, Bo},
  journal={arXiv preprint arXiv:2606.29526},
  year={2026},
  url={https://anitaleungxx.github.io/MIPU/}
}

The Mirage of Optimizing Training Policies:Monotonic Inference Policies as the Real Objective for LLM Reinforcement Learning