Exploratory Annealed Decoding (EAD) for Reinforcement Learning with Verifiable Rewards
Reinforcement learning with verifiable rewards (RLVR) is a powerful approach to enhance the capabilities of Large Language Models (LLMs) in domains such as mathematical reasoning and code generation. However, achieving effective exploration while preserving sample quality and ensuring training stability remains a fundamental challenge.
We propose Exploratory Annealed Decoding (EAD), a simple yet effective strategy that addresses this challenge by leveraging a key insight into sequential generation: exploration is not equally valuable at every step. The initial tokens shape a sequence's semantic direction and structure, making early exploration crucial for discovering diverse valid solutions. Later tokens, however, fill in details within the established context, where excessive exploration can harm coherence.
Our core strategy is: explore at the beginning, exploit at the end. EAD implements an intuitive temperature annealing schedule that:
- Starts with high temperature ($\tau > 1$) to encourage diverse exploration of solution paths
- Gradually cools to lower temperatures to ensure coherent, high-quality completions
- Maintains proximity to the target policy for stable off-policy learning
EAD is a plug-and-play enhancement that improves sample efficiency over fixed-temperature sampling, delivering robust gains across various RLVR algorithms including GRPO, DAPO, and EntropyMech on both small and larger models.
Our contributions include: (1) proposing EAD as a simple and effective exploration strategy for RLVR, (2) demonstrating its broad applicability across different RL algorithms and model sizes, and (3) showing that EAD can also enhance inference-time generation quality.
Method: Dynamic Temperature Annealing
To put the principle of "explore early, exploit late" into practice, we introduce Exploratory Annealed Decoding (EAD), which uses an annealed temperature schedule starting from a higher-than-standard initial temperature (i.e., $\tau>1$).
Exploratory Annealed Decoding
Instead of a fixed temperature, our method dynamically adjusts the temperature $\tau_t$ for each token $t$ in a rollout. The schedule starts at a high temperature $\tau_\mathrm{max} > 1$ and decreases progressively throughout the generation process.
$$\tau_t = \max\{1 + \tau_\mathrm{max} - e^{t/d}, \tau_\mathrm{min}\}$$
where we apply the annealed schedule with a decay rate $d$ controlling the annealing speed. The decay rate $d$ controls how long the policy remains in a high-exploration state. A larger $d$ front-loads exploration across more initial tokens, while a smaller $d$ transitions to exploitation more quickly.
Global-Step-Aware Decay Rate
As training progresses and response lengths increase, the decay rate $d$ should be adjusted in accordance with the training step. We adopt the following global-step-aware decay rate:
$$d_s = \min(d_0 + 5s, 40000)$$
Ensuring Stability with Truncated Importance Sampling
With aggressive annealing schedules, sampling low-probability, long-tail tokens can cause the annealed policy to deviate significantly from the one being optimized. To mitigate this, we employ truncated importance sampling (TIS) to correct the objective, ensuring stable optimization even under highly exploratory schedules.
Overall, this annealed decoding strategy offers a compelling combination of effectiveness and efficiency. As a plug-and-play modification to standard temperature sampling, it incurs negligible computational overhead and is fully compatible with existing RLVR pipelines.
EAD Improves RLVR Training
EAD Improves RL Exploration and Training Efficiency
As shown in the figure above, EAD significantly improves training efficiency. For Pass@16 accuracy, EAD (w/o TIS) consistently outperforms the baselines on the Llama and Qwen models, demonstrating more effective exploration. Under the stricter Worst@16 metric, the inclusion of TIS becomes essential for maintaining stable performance gains.
To verify that our method generalizes, we evaluated it on the larger Qwen-2.5-Math-7B model. The results confirm that the performance gains from EAD remain significant, demonstrating that our approach is effective not only on smaller models but also scales successfully.
EAD Mitigates Entropy Collapse
One major problem in RLVR training is entropy collapse, which causes the exploration space to shrink and constrains improvement during the "plateau stage". As shown in the figure, the entropy dynamic for EAD-empowered methods is not monotonically decreasing from the beginning. Instead, it tries to gradually transition out from local optimum in a natural, continuous way.
Sample Efficiency of EAD
Increasing the number of rollouts is a common but computationally expensive strategy to enhance exploration. We test the sample efficiency of EAD by varying the number of rollouts. As shown, while more rollouts can further improve performance, EAD achieves strong results with just 4 or 8 rollouts, highlighting the sample efficiency of our approach.
EAD Improves Inference-Time Scaling
To understand whether the success of EAD in RL training is driven by its ability to generate high-quality samples, we conduct an evaluation at inference time. Using off-the-shelf Qwen-2.5 models without any fine-tuning, we compare EAD against fixed-temperature sampling. We use majority voting ($\text{Majority}@N$) to measure how performance scales with the number of samples $N$.
As shown in the figure, EAD consistently improves over the baseline for most values of $N$. This result confirms that EAD's advantage stems from its inherent capacity to discover higher-quality solutions, even without any training.
EAD is Compatible with Various RL Algorithms
To demonstrate that EAD is a general, plug-and-play exploration strategy, we evaluate its performance when integrated into two other prominent RL algorithms: GRPO and EntropyMech. These algorithms provide diverse testbeds: GRPO is more conservative, constraining policy updates with a KL divergence penalty, while EntropyMech uses a specialized token-clipping mechanism to mitigate entropy collapse.
As shown in the figure, EAD consistently outperforms fixed-temperature sampling in both frameworks. These results confirm the broad applicability of our method as an improved exploration strategy across different RL algorithms.