Video: Demonstration of Branching Factor concept in LLM generation

Investigating LLM Probability Concentration via Branching Factor (BF)

Despite their impressive capabilities, aligned large language models (LLMs) often generate outputs that lack diversity. What drives this stability in generation? We investigate this phenomenon through the lens of probability concentration in the model's output distribution.

To quantify this concentration, we introduce the Branching Factor (BF)--a token-invariant measure of the effective number of plausible next steps during generation. Our empirical analysis reveals two key findings:

  • BF often decreases as generation progresses, suggesting that LLMs become more predictable as they generate
  • Alignment tuning substantially sharpens the model's output distribution from the outset, reducing BF by nearly an order of magnitude (e.g., from 12 to 1.2) relative to base models
Loading figure...

We illustrate the concept above. Building on this insight, we find this stability has surprising implications for complex reasoning.

Aligned Chain-of-Thought (CoT) models (e.g., DeepSeek-distilled models) leverage this effect by generating longer reasoning chains, pushing generation into later, more deterministic (lower BF) stages, resulting in more stable outputs.

We hypothesize that alignment tuning does not fundamentally change a model's behavior, but instead steers it toward stylistic tokens (e.g., "Sure") that unlock low-entropy trajectories already present in the base model. This view is supported by nudging experiments, which show that prompting base models with such tokens can similarly reduce BF.

Together, our findings establish BF as a powerful diagnostic for understanding and controlling LLM outputs - shedding light on

  • How Alignment reduces variability
  • How CoT promotes stable generations
  • How base models can be steered away from diversity

Preliminary and Notations

Case Study: Is Decoding Method Still Crucial for Modern LLMs?

Many prevalent decoding methods were introduced before LLMs scaled to billions of parameters and underwent multiple training stages. Additionally, model developers adopt different decoding strategies when reporting LLM capabilities , raising questions about the significance of decoding choices for modern LLMs. To explore this, we benchmark various decoding methods on standard LLM reasoning tasks, extending prior work to the latest models including DeepSeek-distilled models , which would generate long CoT before the final answer. Specifically, we evaluate model performance on MMLU-STEM under CoT prompting across different temperatures ($T=0.6 / 1.0$) in temperature sampling and truncation thresholds ($p=0.9 / 1.0$) in nucleus sampling .

Models Default ($T=0.6, p=0.9$) $T=0.6, p=1.0$ $T=1.0, p=0.9$ Min ($T=1.0, p=1.0$) $\frac{\text{Default}-\text{Min}}{\text{Default}}\%$
Llama-3-70B-Instruct 78.50 ($\pm$ 2.09) 77.60 ($\pm$ 2.23) 77.50 ($\pm$ 2.60) 75.90 ($\pm$ 2.85) 3.31
Llama-3-70B 78.00 ($\pm$ 3.52) 74.00 ($\pm$ 3.80) 72.00 ($\pm$ 4.38) 63.50 ($\pm$ 5.02) 18.59
DeepSeek-R1-Distill-Llama-8B 66.30 ($\pm$ 3.51) 65.70 ($\pm$ 3.84) 62.70 ($\pm$ 4.14) 59.70 ($\pm$ 4.65) 9.95
Llama-3.1-8B-Instruct 63.00 ($\pm$ 4.01) 61.50 ($\pm$ 4.37) 57.50 ($\pm$ 4.92) 50.50 ($\pm$ 5.34) 19.84
Llama-3.1-8B 54.00 ($\pm$ 4.61) 53.50 ($\pm$ 4.92) 47.00 ($\pm$ 5.21) 37.00 ($\pm$ 5.48) 31.48

Table 1: Experiment Results across decoding methods on STEM subset of MMLU. We follow the common practice of using 5-shot CoT prompting. $\frac{\text{Default}-\text{Min}}{\text{Default}}\%$ indicates the maximum relative performance drop when deviating from the default decoding configuration.

The results in Table 1 reveal that for aligned models, decoding configurations have a limited impact -- typically around 10% (up to 20%) relative performance changes. Among Llama-3.1-8B models, DeepSeek-distilled Llama-8B (based on Llama-3.1-8B), which is trained to generate long CoT, exhibits the smallest relative performance changes. In contrast, base models exhibit greater sensitivity, with performance varying by up to 31%. Additionally, lowering the temperature ($T$) generally improves performance across all models more than adjusting truncation threshold ($p$), though excessive reduction (e.g., greedy decoding when $T=0$) may lead to repetition issues .

Based on these observations and findings in existing literature, we propose the following hypotheses:

Hypothesis 1:

Aligned models produce tokens with a more concentrated distribution than base models .

Hypothesis 2:

Larger models have more concentrated distributions compared with smaller models , though may varied by tasks .

Hypothesis 3:

As LLMs generate more tokens, its next-word prediction probability distribution becomes increasingly concentrated .

Researchers often assess probability concentration using token-level metrics such as entropy or log-likelihood. However, these offer only a narrow lens on model behavior: they capture local properties but miss the global structure of the output space--how probability mass is distributed across plausible sequences. This motivates our proposal of the BF as a structural measure of generative breadth.

Measuring Branching Factor

The generative process of language models can be viewed as moving down a branching tree, with each token choice selecting a path forward. While the full tree spans $O(|V|^N)$ sequences for vocabulary size $|V|$ and sequence length $N$, LLMs concentrate probability mass on a far smaller subset. To capture how many options the model seriously considers at each step, we introduce the Branching Factor (BF). Given $|\mathcal{T}|$ high-probability leaf sequences, we approximate the tree as a balanced $B$-ary tree, where $B = |\mathcal{T}|^{1/N}$. In this section, we describe how to compute $|\mathcal{T}|$ and $B$ in practice.

Intuitive Perspective: Exponentiated Entropy (Perplexity) as Branches

We propose to use the exponentiated entropy (perplexity) to quantify $|\mathcal{T}|$: $|\mathcal{T}| \stackrel{\text{def}}{=} \exp\left({H}(Y_{1:N} | x; \theta)\right)$. This reflects the effective number of equally probable outcomes with the same total uncertainty . Analogously, it is like sampling from a fair $|\mathcal{T}|$-sided die, where entropy equals $-\sum \frac{1}{|\mathcal{T}|}\log {\frac{1}{|\mathcal{T}|}} = {H}(Y_{1:N} | x; \theta)$. Thus, $B(x;\theta) = \exp\left({\bar{H}(Y_{1:N} | x; \theta)}\right)$ where ${\bar{H}(Y_{1:N} | x; \theta)}=\frac{1}{N} \tilde{H}\left(Y_{1:N} | x; \theta \right)$ is the averaged entropy per output token up to position $N$. A larger $B(x; \theta)$ indicates a greater potential for diverse outputs.

For short outputs, where it's tractable to sample sufficiently many sequences to closely estimate the conditional entropy at each position, we can estimate the BF by computing the conditional entropy at each position and then aggregating as:

$$B(x; \theta) \approx \exp\left(\frac{1}{M} \sum_{i=1}^M \frac{\sum_{t=1}^{|y^{(i)}|} \tilde{H}(Y_t | [x, y_{1:t-1}^{(i)}]; \theta)}{|y^{(i)}|}\right)$$

where $\tilde{H}(Y_t | [x, y_{1:t-1}^{(i)}]; \theta)$ is the entropy of the distribution at position $t$ for sample $i$.

Practical BF Estimator via Asymptotic Equipartition Property

While the above approach works well for short outputs, it becomes challenging for longer sequences, as we can only sample a tiny fraction of the exponentially large output space. In such cases, we show that when LLMs generate sufficiently long outputs, the average log-probability of each output sequence will be roughly the same, and can approximate average output entropy well, following the Asymptotic Equipartition Property (AEP) . The original AEP proof requires additional assumptions about the generation process, such as that it needs to be stationary and ergodic, often violated by LLMs. But as noted by , these assumptions are unnecessary if we do not require ${\bar{H}\left(Y_{1:N} | x; \theta \right)}$ to converge to a constant:

Theorem (AEP for LLMs):

Given $0 < \epsilon < 1$, we have:

$$\lim_{N \rightarrow \infty}{P\left( \left\lvert -\frac{1}{N}\log \tilde{P}\left(y_{1:N} | x; \theta \right) - {\bar{H}\left(Y_{1:N} | x; \theta \right)}\right\rvert < \epsilon \right) } = 1$$

This theorem is equivalent to the statement: for sufficiently large $N$, the probability of any length-$N$ high-probability output $y_{1:N}$ under $\tilde{P}$ can be approximated as $\exp\left(-N\bar{H}(Y_{1:N} | x; \theta)\right)$, rendering log-probability asymptotically ineffective for distinguishing among them.

As an empirical demonstration, we plot the standard deviation of the average negative log-likelihood of Llama-3-8B-Instruct over multiple datasets in Figure 2, where we can see that with the increased output length, the difference between length-averaged entropy and negative log-likelihood (NLL) is reduced, and the standard deviation of average NLL also quickly reduces within the first 50 output tokens.

Loading figure...
Figure 2a: MMLU - Length-averaged NLL closely tracks length-averaged Entropy.
Loading figure...
Figure 2b: BBC News - Length-averaged NLL closely tracks length-averaged Entropy.
Loading figure...
Figure 2c: MMLU - Standard deviation of length-averaged NLL diminishes with output length.
Loading figure...
Figure 2d: BBC News - Standard deviation of length-averaged NLL diminishes with output length.

Therefore, for long sequences, we can estimate BF using the NLL of sampled sequences as:

$$B(x; \theta) \approx \exp\left(-\frac{1}{M} \sum_{i=1}^M \frac{1}{|y^{(i)}|}\log \tilde{P}\left(y_{1: |y^{(i)}|} | x; \theta \right)\right)$$

This approach allows us to compute BF in a sample-efficient way. For task-wise BF, we simply compute it via averaging all instance-wise BF: $B(X; \theta) = \sum_{x} p(x) B(x; \theta)$.

Benchmarking and Attributing Branch Factors

In this part, we will introduce our BF computation experiment settings, including models, tasks, and the impact factors influencing BF.

Models and Sampling

We run experiments on models from Llama-2 and Llama-3 families as they are widely-used open-weight model families. For each model family, we include both base and aligned models to investigate how alignment tuning affects BF. We set $p=0.9$ and $T=1.0$ to sample outputs to conform with the setting for most datasets.

We set $M=50$ sequences to estimate BF, which yields a reliable estimation across datasets in prior studies. For aligned models, we apply the official chat templates to prompts. In addition, we carefully control the lengths of all inputs plus outputs to be within the context window of the models.

Tasks

We consider a variety of tasks covering common application scenarios of LLM generation, including reasoning and open-ended generation: MMLU (Reasoning), Cognac (Controlled Generation), BBCLatestNews (News Generation), and Creative Story Generation (Creative Generation). To test subjective randomness bias , we also prepare a synthetic task Random Strings where the prompt is generated via randomly sampled characters.

Impact Factors (IFs)

We consider modulating these factors that may impact BF computations: Prompt Complexity ($C$), Alignment Tuning $(AT \in \{\text{Instruct},\text{Base}\})$, Model Size $(S \in \{8\text{B}/13\text{B},70\text{B}\})$, and Model Generation $(G \in \{2,3\})$. $C$ controls the informativeness of the input prompt $x$ (e.g., the number of banned words in Cognac, the number of in-context samples in MMLU). Intuitively, providing more information in $x$ should make the model more confident in its outputs, resulting in a lower BF. $AT, S, G$ represent model-wise variations to explore how different configurations of $\theta$ affect $B(X; \theta)$.

BF Dynamic in Generation Process

Both BF and the output length $N$ are functions of the output $Y$, and BF computation relies on $N$. To avoid confounding effects, we first analyze how BF varies with $N$ before intervening IFs. We demonstrate BF trajectories over different output positions by running Llama-3-70B and Llama-3-70B-Instruct on three representative tasks. Specifically, we compute BF over every five output tokens, conditioning on the prompt and all previously generated output tokens.

Loading figure...
Figure 3a: Creative Story Generation - Llama-3-70B Base
Loading figure...
Figure 3b: Random Strings - Llama-3-70B Base
Loading figure...
Figure 3c: BBC News - Llama-3-70B Base
Loading figure...
Figure 3d: Creative Story Generation - Llama-3-70B-Instruct
Loading figure...
Figure 3e: Random Strings - Llama-3-70B-Instruct
Loading figure...
Figure 3f: BBC News - Llama-3-70B-Instruct

As we can see, first, the average BF for the base model (~$\approx 12$) is roughly ten times higher than the aligned model ($\approx 1.2$). Therefore, there are actually very few candidate next-token to be truncated in decoding for the aligned models. This explains why the decoding method would assert weaker effects for aligned models. Also, in most cases, BF would often drop smoothly as more output tokens are generated. Under the same task, when $C>0$, different $C$ mainly controls the starting point and the rate of decreasing, while in the end, they would converge to roughly the same point. When almost zero knowledge is provided ($C=0$), the output will end much earlier compared to $C > 0$ cases. These findings also provide support that the future token generation is gradually becoming predictable and the model may have a certain generation plan to follow, resonating with recent observation in interpretability and inference acceleration .

Pareto Analysis of BF

We perform a Pareto analysis to identify the relative influence of all IFs of BF. For each factor $D_i$, we define the unnormalized Impact $\tilde{I}(D_i)$ as the average absolute pairwise difference in BF when varying $D_i$ while holding other dimensions constant:

$$\tilde{I}(D_i) = \frac{ \sum_{d_i, d_j \in \text{Domain}(D_i), d_i \neq d_j} {|\text{Avg}(\text{B}(\cdot | D_i=d_i)) - \text{Avg}(\text{B}(\cdot | D_i=d_j))|}}{|\text{Domain}(D_i)| * |\text{Domain}(D_i) - 1|}$$

Then we normalize it as ${I}(D_i)=\frac{\tilde{I}(D_i)}{\sum \tilde{I}(D_i)}$.

Loading figure...
Figure 4a: Cognac - Pareto Analysis
Loading figure...
Figure 4b: MMLU - Pareto Analysis
Loading figure...
Figure 4c: BBC News - Pareto Analysis
Loading figure...
Figure 4d: Creative Story Generation - Pareto Analysis

The results indicate that alignment tuning is the most influential factor affecting BF, surpassing model size, model generation, and prompt complexity by a large margin. For tasks with richer inputs--such as MMLU (with more in-context examples) and BBCLatestNews (with more headlines)--prompt complexity $C$ and model size $M$ emerge as the next most impactful factors. In contrast, for open-ended tasks like Cognac and Story Generation, model generation $G$--particularly improvements from Llama-2 to Llama-3--plays a more dominant role. This shift likely reflects gains from the use of larger, more diverse datasets in training .

Curious Case of Prompt Complexity

Intuitively, greater prompt specificity (larger $C$) reduces BF by narrowing the model's output space through more informative context. However, our experimental results reveal task-varied effects. For the Cognac task, greater prompt complexity can increase BF--potentially due to the cognitive burden of processing negation or complex linguistic structures. In contrast, for tasks like News Generation, higher $C$ generally leads to lower BF, consistent with the expected narrowing of output diversity.

Loading figure...
Figure A.3a: Cognac - BF vs Prompt Complexity
Loading figure...
Figure A.3b: Creative Story Generation - BF vs Prompt Complexity
Loading figure...
Figure A.3c: Random Strings - BF vs Prompt Complexity
Loading figure...
Figure A.3d: BBC News - BF vs Prompt Complexity

Application: Variance Reduction and Risks of Mid-Generation Forking

Building on our findings that BF declines over the generation process and is lower in aligned models, we derive a practical implication: aligned CoT models, by starting with low BF and delaying decisive tokens, shrink the output space more aggressively and produce fewer high-probability variants. To test this, we evaluate output variability on MMLU-STEM using 200 samples per model, measuring the standard deviation of Majority@K accuracy for $K = 1, 3, 8, 16$ under temperature $T=0.6$ and truncation threshold $p=0.9$.

Model Maj@1 Std Maj@3 Std Maj@8 Std Maj@16 Std BF@1 BF
DeepSeek-R1-Distill-Llama-70B 14.34 8.29 4.99 3.21 1.77 1.23
Llama-3-70B-Instruct 16.37 11.40 7.50 5.12 2.44 1.28
Llama-3-70B 27.78 19.53 13.22 9.23 2.41 1.31
DeepSeek-R1-Distill-Llama-8B 27.10 20.91 13.93 9.14 1.77 1.23
Llama-3.1-8B-Instruct 31.54 24.64 17.30 12.90 2.73 1.31
Llama-3.1-8B 36.41 29.78 20.43 14.05 2.53 1.35

Table 2: Majority Voting@K standard deviation on MMLU-STEM with 200 samples. We compute the standard deviation over 100 bootstrapping trials, each using 64 samples per instance. We set $T=0.6, p=0.9$ to match standard benchmarking settings. Lower temperature concentrates probability mass on fewer tokens, reducing BF and making direct comparisons more difficult. Still, BF remains a strong predictor of standard deviation.

As shown in Table 2, among models with similar capacity, those with lower BF--especially the aligned CoT model--exhibit markedly lower variance. This confirms that BF is a reliable predictor of sampling stability.

Loading figure...
Figure 5: Resampling from different output positions to assess the effect of interrupting BF reduction. We resample new continuations at the 25th and 200th output token of DeepSeek-Distilled Llama-8B MMLU outputs. Results show substantial performance drops at both positions.

How does Alignment Tuning Impact BF?

Why does alignment tuning exert such a pronounced effect on BF? Building on the superficial alignment hypothesis ("Alignment tuning might simply teach base LLMs to select a subdistribution of data formats for interacting with users.") and recent work on tuning-free alignment , we hypothesize that base models already encode low-entropy conditional distributions. In this view, alignment tuning doesn't reshape generation from scratch, but instead nudge the model toward stylistic tokens (e.g., "Sure"), thereby narrowing the conditional output distribution.

To test this hypothesis, we reproduce the nudging experiments , over Just-Eval-Instruct and MMLU datasets. We employ Llama-3-70B for drafting most outputs. However, when the base model's Top-1 probability is low, we apply nudging by switching to Llama-3-8B-Instruct to generate a single word. BF was computed as in prior experiments.

The results indicate that after nudging occurs early in the generation process -- indicating the prefix generated by the nudging model is of low probability. These observations collectively support our hypothesis. Considering that nudging not only reduces BF but also improves aligned model performance on these tasks , our results highlight the dual effect of alignment training: reducing BF while preserving or even enhancing task performance.

Loading figure...
Figure A.4: Output Perplexity Dynamics in Nudging Experiments.
Loading figure...
Figure A.5: Nudging Ratio Histogram.
Loading figure...
Figure A.4a: Just-Eval-Instruct - BF Dynamics with Nudging
Loading figure...
Figure A.4b: MMLU - BF Dynamics with Nudging
Loading figure...
Figure A.4c: Just-Eval-Instruct - Model Frequency Histogram
Loading figure...
Figure A.4d: MMLU - Model Frequency Histogram