Loading figure...
Figure 1: The annealing schedule with different decay rates $d$. A larger $d$ slows the cooling, front-loading exploration over more tokens. We set $c=10, \tau_{\mathrm{max}}=1.2, \tau_{\mathrm{min}}=0.1$ for illustration.

Exploratory Annealed Decoding (EAD) for Reinforcement Learning with Verifiable Rewards

Reinforcement learning with verifiable rewards (RLVR) is a powerful approach to enhance the capabilities of Large Language Models (LLMs) in domains such as mathematical reasoning and code generation. However, achieving effective exploration while preserving sample quality and ensuring training stability remains a fundamental challenge.

We propose Exploratory Annealed Decoding (EAD), a simple yet effective strategy that addresses this challenge by leveraging a key insight into sequential generation: exploration is not equally valuable at every step. The initial tokens shape a sequence's semantic direction and structure, making early exploration crucial for discovering diverse valid solutions. Later tokens, however, fill in details within the established context, where excessive exploration can harm coherence.

Our core strategy is: explore at the beginning, exploit at the end. EAD implements an intuitive temperature annealing schedule that:

  • Starts with high temperature ($\tau > 1$) to encourage diverse exploration of solution paths
  • Gradually cools to lower temperatures to ensure coherent, high-quality completions
  • Maintains proximity to the target policy for stable off-policy learning
EAD is a plug-and-play enhancement that improves sample efficiency over fixed-temperature sampling, delivering robust gains across various RLVR algorithms including GRPO, DAPO, and EntropyMech on both small and larger models.

Our contributions include: (1) proposing EAD as a simple and effective exploration strategy for RLVR, (2) demonstrating its broad applicability across different RL algorithms and model sizes, and (3) showing that EAD can also enhance inference-time generation quality.

Method: Dynamic Temperature Annealing

To put the principle of "explore early, exploit late" into practice, we introduce Exploratory Annealed Decoding (EAD), which uses an annealed temperature schedule starting from a higher-than-standard initial temperature (i.e., $\tau>1$).

Exploratory Annealed Decoding

Instead of a fixed temperature, our method dynamically adjusts the temperature $\tau_t$ for each token $t$ in a rollout. The schedule starts at a high temperature $\tau_\mathrm{max} > 1$ and decreases progressively throughout the generation process.

$$\tau_t = \max\{1 + \tau_\mathrm{max} - e^{t/d}, \tau_\mathrm{min}\}$$

where we apply the annealed schedule with a decay rate $d$ controlling the annealing speed. The decay rate $d$ controls how long the policy remains in a high-exploration state. A larger $d$ front-loads exploration across more initial tokens, while a smaller $d$ transitions to exploitation more quickly.

Global-Step-Aware Decay Rate

As training progresses and response lengths increase, the decay rate $d$ should be adjusted in accordance with the training step. We adopt the following global-step-aware decay rate:

$$d_s = \min(d_0 + 5s, 40000)$$

Ensuring Stability with Truncated Importance Sampling

With aggressive annealing schedules, sampling low-probability, long-tail tokens can cause the annealed policy to deviate significantly from the one being optimized. To mitigate this, we employ truncated importance sampling (TIS) to correct the objective, ensuring stable optimization even under highly exploratory schedules.

Overall, this annealed decoding strategy offers a compelling combination of effectiveness and efficiency. As a plug-and-play modification to standard temperature sampling, it incurs negligible computational overhead and is fully compatible with existing RLVR pipelines.

EAD Improves RLVR Training

EAD Improves RL Exploration and Training Efficiency

Loading figure...
Figure 2: Pass@16 and Worst@16 performance evaluation in RL training. While EAD improves exploration of high-quality samples (even the worst outperform temperature sampling), the gain diminishes over time; importance sampling can supplement to correct bias and sustain training.

As shown in the figure above, EAD significantly improves training efficiency. For Pass@16 accuracy, EAD (w/o TIS) consistently outperforms the baselines on the Llama and Qwen models, demonstrating more effective exploration. Under the stricter Worst@16 metric, the inclusion of TIS becomes essential for maintaining stable performance gains.

Loading figure...
Figure 3: Pass@16 performance on Qwen-2.5-Math-7B. EAD enables better exploration than fixed-temperature sampling, yielding sustained gains in Pass@16 throughout training.

To verify that our method generalizes, we evaluated it on the larger Qwen-2.5-Math-7B model. The results confirm that the performance gains from EAD remain significant, demonstrating that our approach is effective not only on smaller models but also scales successfully.

EAD Mitigates Entropy Collapse

Loading figure...
Figure 4: Entropy Dynamics in RL Training. Under commonly-used temperature sampling, trained with RL algorithm would make entropy decrease, sharply shrinking the exploration space for RL from beginning. While EAD could help RL algorithm to escape local minimum and do exploration when needed in the middle of RL training.

One major problem in RLVR training is entropy collapse, which causes the exploration space to shrink and constrains improvement during the "plateau stage". As shown in the figure, the entropy dynamic for EAD-empowered methods is not monotonically decreasing from the beginning. Instead, it tries to gradually transition out from local optimum in a natural, continuous way.

Sample Efficiency of EAD

Loading figure...
Figure 5: EAD would bring further performance improvement via increased numbers of rollouts, but the commonly used 4 or 8 is already good enough.

Increasing the number of rollouts is a common but computationally expensive strategy to enhance exploration. We test the sample efficiency of EAD by varying the number of rollouts. As shown, while more rollouts can further improve performance, EAD achieves strong results with just 4 or 8 rollouts, highlighting the sample efficiency of our approach.

EAD Improves Inference-Time Scaling

Loading figure...
Figure 6: Inference-Time Scaling Evaluation for Different Decoding Methods using off-the-shelf Qwen2.5 models. We could see that EAD improves traditional temperature sampling. We set $\tau_{\text{max}}=1.2, \tau_{\text{min}}=0.1, d=25$ for EAD.

To understand whether the success of EAD in RL training is driven by its ability to generate high-quality samples, we conduct an evaluation at inference time. Using off-the-shelf Qwen-2.5 models without any fine-tuning, we compare EAD against fixed-temperature sampling. We use majority voting ($\text{Majority}@N$) to measure how performance scales with the number of samples $N$.

As shown in the figure, EAD consistently improves over the baseline for most values of $N$. This result confirms that EAD's advantage stems from its inherent capacity to discover higher-quality solutions, even without any training.

EAD is Compatible with Various RL Algorithms

Loading figure...
Figure 7: EAD is compatible with various RL algorithms and can significantly improve the model performance over time.

To demonstrate that EAD is a general, plug-and-play exploration strategy, we evaluate its performance when integrated into two other prominent RL algorithms: GRPO and EntropyMech. These algorithms provide diverse testbeds: GRPO is more conservative, constraining policy updates with a KL divergence penalty, while EntropyMech uses a specialized token-clipping mechanism to mitigate entropy collapse.

As shown in the figure, EAD consistently outperforms fixed-temperature sampling in both frameworks. These results confirm the broad applicability of our method as an improved exploration strategy across different RL algorithms.