Investigating LLM Probability Concentration via Branching Factor (BF)
Despite their impressive capabilities, aligned large language models (LLMs) often generate outputs that lack diversity. What drives this stability in generation? We investigate this phenomenon through the lens of probability concentration in the model's output distribution.
To quantify this concentration, we introduce the Branching Factor (BF)--a token-invariant measure of the effective number of plausible next steps during generation. Our empirical analysis reveals two key findings:
- BF often decreases as generation progresses, suggesting that LLMs become more predictable as they generate
- Alignment tuning substantially sharpens the model's output distribution from the outset, reducing BF by nearly an order of magnitude (e.g., from 12 to 1.2) relative to base models
We illustrate the concept above. Building on this insight, we find this stability has surprising implications for complex reasoning.
Aligned Chain-of-Thought (CoT) models (e.g., DeepSeek-distilled models) leverage this effect by generating longer reasoning chains, pushing generation into later, more deterministic (lower BF) stages, resulting in more stable outputs.
We hypothesize that alignment tuning does not fundamentally change a model's behavior, but instead steers it toward stylistic tokens (e.g., "Sure") that unlock low-entropy trajectories already present in the base model. This view is supported by nudging experiments, which show that prompting base models with such tokens can similarly reduce BF.
Together, our findings establish BF as a powerful diagnostic for understanding and controlling LLM outputs - shedding light on
- How Alignment reduces variability
- How CoT promotes stable generations
- How base models can be steered away from diversity
Preliminary and Notations
Autoregressive Language Models
LLMs are typically trained to predict the next token and the probability of output $P\left(y_{1:N} | x; \theta \right)$ can be decomposed as: $P\left(y_{1:N} | x; \theta \right)=\prod_{t=1}^{N}P\left(y_t | [x, y_{1:t-1}]; \theta \right)$,
where $y_{1:t-1}$ is the output up to position $t-1$, $\theta$ is the model parameter, and $x$ is the prompt.
Each output sample is generated via token-by-token sampling, and the generation of multiple samples naturally forms a search tree
Decoding Methods as Truncated Sampling
Though LLMs are trained with a large vocabulary size $|V|$, in many cases, the desired tokens often concentrate on a much smaller set of tokens under distribution $P(y_{t} | x, y_{1:t-1}; \theta)$. Common decoding methods
$$\tilde{P}\left(y_t | [x, y_{1:t-1}]; \theta \right) = \begin{cases} \frac{P(y_{t} | x, y_{1:t-1}; \theta)}{\sum_{y_{t} \in V_t} P(y_{t} | x, y_{1:t-1}; \theta) } & y_{t} \in V_t \\ 0 & \text{otherwise} \end{cases}$$
Token-wise Conditional Entropy
Since tokens are sampled from the truncated distribution $\tilde{P}$, we use $\tilde{P}$ to compute the token-level conditional entropy $\tilde{H}$ for a given prefix instance $y_{1:t-1}$:
$$\tilde{H}\left(Y_t | [x, y_{1:t-1}]; \theta \right) =-\sum_{y_t} \tilde{P}\left(y_t | [x, y_{1:t-1}]; \theta \right) \log \tilde{P}\left(y_t | [x, y_{1:t-1}]; \theta \right)$$
To generalize, we can compute the expected conditional entropy over the distribution of prefix sequences $Y_{1:t-1}$: $\tilde{H}\left(Y_t | [x, Y_{1:t-1}]; \theta \right)=\mathbb{E}_{y_{1:t-1}}\tilde{H}\left(Y_t | [x, y_{1:t-1}]; \theta \right)$.
Case Study: Is Decoding Method Still Crucial for Modern LLMs?
Many prevalent decoding methods were introduced before LLMs scaled to billions of parameters and underwent multiple training stages. Additionally, model developers adopt different decoding strategies when reporting LLM capabilities
Models | Default ($T=0.6, p=0.9$) | $T=0.6, p=1.0$ | $T=1.0, p=0.9$ | Min ($T=1.0, p=1.0$) | $\frac{\text{Default}-\text{Min}}{\text{Default}}\%$ |
---|---|---|---|---|---|
Llama-3-70B-Instruct | 78.50 ($\pm$ 2.09) | 77.60 ($\pm$ 2.23) | 77.50 ($\pm$ 2.60) | 75.90 ($\pm$ 2.85) | 3.31 |
Llama-3-70B | 78.00 ($\pm$ 3.52) | 74.00 ($\pm$ 3.80) | 72.00 ($\pm$ 4.38) | 63.50 ($\pm$ 5.02) | 18.59 |
DeepSeek-R1-Distill-Llama-8B | 66.30 ($\pm$ 3.51) | 65.70 ($\pm$ 3.84) | 62.70 ($\pm$ 4.14) | 59.70 ($\pm$ 4.65) | 9.95 |
Llama-3.1-8B-Instruct | 63.00 ($\pm$ 4.01) | 61.50 ($\pm$ 4.37) | 57.50 ($\pm$ 4.92) | 50.50 ($\pm$ 5.34) | 19.84 |
Llama-3.1-8B | 54.00 ($\pm$ 4.61) | 53.50 ($\pm$ 4.92) | 47.00 ($\pm$ 5.21) | 37.00 ($\pm$ 5.48) | 31.48 |
Table 1: Experiment Results across decoding methods on STEM subset of MMLU. We follow the common practice of using 5-shot CoT prompting. $\frac{\text{Default}-\text{Min}}{\text{Default}}\%$ indicates the maximum relative performance drop when deviating from the default decoding configuration.
The results in Table 1 reveal that for aligned models, decoding configurations have a limited impact -- typically around 10% (up to 20%) relative performance changes. Among Llama-3.1-8B models, DeepSeek-distilled Llama-8B (based on Llama-3.1-8B), which is trained to generate long CoT, exhibits the smallest relative performance changes. In contrast, base models exhibit greater sensitivity, with performance varying by up to 31%. Additionally, lowering the temperature ($T$) generally improves performance across all models more than adjusting truncation threshold ($p$), though excessive reduction (e.g., greedy decoding when $T=0$) may lead to repetition issues
Based on these observations and findings in existing literature, we propose the following hypotheses:
Hypothesis 1:
Aligned models produce tokens with a more concentrated distribution than base models
Hypothesis 2:
Larger models have more concentrated distributions compared with smaller models
Hypothesis 3:
As LLMs generate more tokens, its next-word prediction probability distribution becomes increasingly concentrated
Researchers often assess probability concentration using token-level metrics such as entropy or log-likelihood. However, these offer only a narrow lens on model behavior: they capture local properties but miss the global structure of the output space--how probability mass is distributed across plausible sequences. This motivates our proposal of the BF as a structural measure of generative breadth.
Measuring Branching Factor
The generative process of language models can be viewed as moving down a branching tree, with each token choice selecting a path forward. While the full tree spans $O(|V|^N)$ sequences for vocabulary size $|V|$ and sequence length $N$, LLMs concentrate probability mass on a far smaller subset. To capture how many options the model seriously considers at each step, we introduce the Branching Factor (BF). Given $|\mathcal{T}|$ high-probability leaf sequences, we approximate the tree as a balanced $B$-ary tree, where $B = |\mathcal{T}|^{1/N}$. In this section, we describe how to compute $|\mathcal{T}|$ and $B$ in practice.
Intuitive Perspective: Exponentiated Entropy (Perplexity) as Branches
We propose to use the exponentiated entropy (perplexity) to quantify $|\mathcal{T}|$: $|\mathcal{T}| \stackrel{\text{def}}{=} \exp\left({H}(Y_{1:N} | x; \theta)\right)$. This reflects the effective number of equally probable outcomes with the same total uncertainty
For short outputs, where it's tractable to sample sufficiently many sequences to closely estimate the conditional entropy at each position, we can estimate the BF by computing the conditional entropy at each position and then aggregating as:
$$B(x; \theta) \approx \exp\left(\frac{1}{M} \sum_{i=1}^M \frac{\sum_{t=1}^{|y^{(i)}|} \tilde{H}(Y_t | [x, y_{1:t-1}^{(i)}]; \theta)}{|y^{(i)}|}\right)$$
where $\tilde{H}(Y_t | [x, y_{1:t-1}^{(i)}]; \theta)$ is the entropy of the distribution at position $t$ for sample $i$.
Practical BF Estimator via Asymptotic Equipartition Property
While the above approach works well for short outputs, it becomes challenging for longer sequences, as we can only sample a tiny fraction of the exponentially large output space. In such cases, we show that when LLMs generate sufficiently long outputs, the average log-probability of each output sequence will be roughly the same, and can approximate average output entropy well, following the Asymptotic Equipartition Property (AEP)
Theorem (AEP for LLMs):
Given $0 < \epsilon < 1$, we have:
$$\lim_{N \rightarrow \infty}{P\left( \left\lvert -\frac{1}{N}\log \tilde{P}\left(y_{1:N} | x; \theta \right) - {\bar{H}\left(Y_{1:N} | x; \theta \right)}\right\rvert < \epsilon \right) } = 1$$
This theorem is equivalent to the statement: for sufficiently large $N$, the probability of any length-$N$ high-probability output $y_{1:N}$ under $\tilde{P}$ can be approximated as $\exp\left(-N\bar{H}(Y_{1:N} | x; \theta)\right)$, rendering log-probability asymptotically ineffective for distinguishing among them.
As an empirical demonstration, we plot the standard deviation of the average negative log-likelihood of Llama-3-8B-Instruct over multiple datasets in Figure 2, where we can see that with the increased output length, the difference between length-averaged entropy and negative log-likelihood (NLL) is reduced, and the standard deviation of average NLL also quickly reduces within the first 50 output tokens.
Therefore, for long sequences, we can estimate BF using the NLL of sampled sequences as:
$$B(x; \theta) \approx \exp\left(-\frac{1}{M} \sum_{i=1}^M \frac{1}{|y^{(i)}|}\log \tilde{P}\left(y_{1: |y^{(i)}|} | x; \theta \right)\right)$$
This approach allows us to compute BF in a sample-efficient way. For task-wise BF, we simply compute it via averaging all instance-wise BF: $B(X; \theta) = \sum_{x} p(x) B(x; \theta)$.
Benchmarking and Attributing Branch Factors
In this part, we will introduce our BF computation experiment settings, including models, tasks, and the impact factors influencing BF.
Models and Sampling
We run experiments on models from Llama-2
We set $M=50$ sequences to estimate BF, which yields a reliable estimation across datasets in prior studies. For aligned models, we apply the official chat templates to prompts. In addition, we carefully control the lengths of all inputs plus outputs to be within the context window of the models.
Tasks
We consider a variety of tasks covering common application scenarios of LLM generation, including reasoning and open-ended generation: MMLU
Impact Factors (IFs)
We consider modulating these factors that may impact BF computations: Prompt Complexity ($C$), Alignment Tuning $(AT \in \{\text{Instruct},\text{Base}\})$, Model Size $(S \in \{8\text{B}/13\text{B},70\text{B}\})$, and Model Generation $(G \in \{2,3\})$. $C$ controls the informativeness of the input prompt $x$ (e.g., the number of banned words in Cognac, the number of in-context samples in MMLU). Intuitively, providing more information in $x$ should make the model more confident in its outputs, resulting in a lower BF. $AT, S, G$ represent model-wise variations to explore how different configurations of $\theta$ affect $B(X; \theta)$.
BF Dynamic in Generation Process
Both BF and the output length $N$ are functions of the output $Y$, and BF computation relies on $N$. To avoid confounding effects, we first analyze how BF varies with $N$ before intervening IFs. We demonstrate BF trajectories over different output positions by running Llama-3-70B and Llama-3-70B-Instruct on three representative tasks. Specifically, we compute BF over every five output tokens, conditioning on the prompt and all previously generated output tokens.
As we can see, first, the average BF for the base model (~$\approx 12$) is roughly ten times higher than the aligned model ($\approx 1.2$). Therefore, there are actually very few candidate next-token to be truncated in decoding for the aligned models. This explains why the decoding method would assert weaker effects for aligned models. Also, in most cases, BF would often drop smoothly as more output tokens are generated. Under the same task, when $C>0$, different $C$ mainly controls the starting point and the rate of decreasing, while in the end, they would converge to roughly the same point. When almost zero knowledge is provided ($C=0$), the output will end much earlier compared to $C > 0$ cases. These findings also provide support that the future token generation is gradually becoming predictable and the model may have a certain generation plan to follow, resonating with recent observation in interpretability
Pareto Analysis of BF
We perform a Pareto analysis to identify the relative influence of all IFs of BF. For each factor $D_i$, we define the unnormalized Impact $\tilde{I}(D_i)$ as the average absolute pairwise difference in BF when varying $D_i$ while holding other dimensions constant:
$$\tilde{I}(D_i) = \frac{ \sum_{d_i, d_j \in \text{Domain}(D_i), d_i \neq d_j} {|\text{Avg}(\text{B}(\cdot | D_i=d_i)) - \text{Avg}(\text{B}(\cdot | D_i=d_j))|}}{|\text{Domain}(D_i)| * |\text{Domain}(D_i) - 1|}$$
Then we normalize it as ${I}(D_i)=\frac{\tilde{I}(D_i)}{\sum \tilde{I}(D_i)}$.
The results indicate that alignment tuning is the most influential factor affecting BF, surpassing model size, model generation, and prompt complexity by a large margin. For tasks with richer inputs--such as MMLU (with more in-context examples) and BBCLatestNews (with more headlines)--prompt complexity $C$ and model size $M$ emerge as the next most impactful factors. In contrast, for open-ended tasks like Cognac and Story Generation, model generation $G$--particularly improvements from Llama-2 to Llama-3--plays a more dominant role. This shift likely reflects gains from the use of larger, more diverse datasets in training
Curious Case of Prompt Complexity
Intuitively, greater prompt specificity (larger $C$) reduces BF by narrowing the model's output space through more informative context. However, our experimental results reveal task-varied effects. For the Cognac task, greater prompt complexity can increase BF--potentially due to the cognitive burden of processing negation or complex linguistic structures. In contrast, for tasks like News Generation, higher $C$ generally leads to lower BF, consistent with the expected narrowing of output diversity.
Application: Variance Reduction and Risks of Mid-Generation Forking
Building on our findings that BF declines over the generation process and is lower in aligned models, we derive a practical implication: aligned CoT models, by starting with low BF and delaying decisive tokens, shrink the output space more aggressively and produce fewer high-probability variants. To test this, we evaluate output variability on MMLU-STEM using 200 samples per model, measuring the standard deviation of Majority@K accuracy for $K = 1, 3, 8, 16$ under temperature $T=0.6$ and truncation threshold $p=0.9$.
Model | Maj@1 Std | Maj@3 Std | Maj@8 Std | Maj@16 Std | BF@1 | BF |
---|---|---|---|---|---|---|
DeepSeek-R1-Distill-Llama-70B | 14.34 | 8.29 | 4.99 | 3.21 | 1.77 | 1.23 |
Llama-3-70B-Instruct | 16.37 | 11.40 | 7.50 | 5.12 | 2.44 | 1.28 |
Llama-3-70B | 27.78 | 19.53 | 13.22 | 9.23 | 2.41 | 1.31 |
DeepSeek-R1-Distill-Llama-8B | 27.10 | 20.91 | 13.93 | 9.14 | 1.77 | 1.23 |
Llama-3.1-8B-Instruct | 31.54 | 24.64 | 17.30 | 12.90 | 2.73 | 1.31 |
Llama-3.1-8B | 36.41 | 29.78 | 20.43 | 14.05 | 2.53 | 1.35 |
Table 2: Majority Voting@K standard deviation on MMLU-STEM with 200 samples. We compute the standard deviation over 100 bootstrapping trials, each using 64 samples per instance. We set $T=0.6, p=0.9$ to match standard benchmarking settings. Lower temperature concentrates probability mass on fewer tokens, reducing BF and making direct comparisons more difficult. Still, BF remains a strong predictor of standard deviation.
As shown in Table 2, among models with similar capacity, those with lower BF--especially the aligned CoT model--exhibit markedly lower variance. This confirms that BF is a reliable predictor of sampling stability.
How does Alignment Tuning Impact BF?
Why does alignment tuning exert such a pronounced effect on BF? Building on the superficial alignment hypothesis
To test this hypothesis, we reproduce the nudging experiments
The results indicate that after nudging occurs early in the generation process -- indicating the prefix generated by the nudging model is of low probability. These observations collectively support our hypothesis. Considering that nudging not only reduces BF but also improves aligned model performance on these tasks