Predicting Future Behaviors in Reasoning Models Enables Better Steering

Evgenii Kortukov1, Piotr Komorowski1, Florian Klein1, Paula Engl1, Gabriele Sarti2, Seong Joon Oh3, Sebastian Lapuschkin1, Wojciech Samek1

1Fraunhofer HHI 2Northeastern University 3KAIST

TL;DR: Extracting latent LLM features that encode future behaviors enables a novel, non-invasive form of text-level steering.

Interpretability tools that predict and control model behaviors typically rely on contrastive input pairs. This binary data hides the probabilistic nature of language model decision making. During reasoning, LLMs can keep track of multiple behavioral options before committing. We find that these future behavior distributions are reflected in their representations. We can steer the behavioral outcome by choosing the reasoning sentences that maximize the estimated probability of a future behavior.

Overview: detection vs. prediction features and the FPCG algorithm.
Overview Left, top: existing steering methods rely on contrastive input pairs and detect behaviors in text. Left, bottom: a distinct set of features enables predicting future behaviors. Right: the proposed FPCG algorithm samples candidate sentences and selects the one that maximizes the activation of a probe predicting future behavior likelihoods.

Probabilistic view of LRM decision-making

Analyzing a single roll-out of a reasoning model is not enough to understand its behavior. Instead, we resample after every reasoning step to get a probability of a future behavior happening after this step. This allows us to see where the model commits to act in a certain way.

Behavior distribution dynamics for two example responses to the same prompt.
Behavior distribution dynamics for two example responses to the same prompt. SORRY-Bench refusal vs. comprehension evaluation of DeepSeek-R1-Distill-Llama-8B.

Predicting future behaviors ≠ detecting current behaviors

We train a probe to predict these future behavior probabilities. Typically, one can estimate the most likely behavior with high accuracy.

Binary classifiers on contrastive data, typically used for detection and steering, are bad predictors of future behaviors. Models represent behaviors that already happened differently from the ones that will happen in the future.

Comparison of probes using prediction features vs. detection features.
Detection vs. prediction features. A Linear Probe trained on all response sentences (prediction features) substantially outperforms one trained only on the final answer activations (detection features) at predicting future behavior, across all behavioral datasets.

Steering the model towards a desired future behavior

Estimating the probability of a future behavior can be used to steer the model towards a desired outcome. We propose a simple algorithm — Future Probe Controlled Generation (FPCG) — that samples multiple candidates for each sentence, and selects the best using an activation probe that predicts future behavior likelihoods.

def future_probe_controlled_generation(
    model, prompt, future_probe,
    layer, num_candidates, direction,
):
    response = ""

    while not finished(response):
        candidates = generate_sentence_candidates(
            model, prompt + response, num_candidates,
        )

        for candidate_sentence in candidates:
            acts = extract_activations(
                model,
                prompt + response + candidate_sentence,
                layer,
            )
            candidate_sentence.score = future_probe(acts)

        if direction == "positive":
            best_sentence = argmax(candidates.scores)
        else:
            best_sentence = argmin(candidates.scores)
        response += best_sentence

    return response
FPCG. For each reasoning step, the model generates several candidate sentences and chooses the one that maximizes the activation of a prediction feature for the desired behavior — steering at the text level, without modifying activations.

We find that FPCG enables steering with less quality degradation than standard difference-in-means activation steering.


Citation

If you build on this work, please cite our paper:

@inproceedings{kortukov2026predicting,
      title={Predicting Future Behaviors in Reasoning Models Enables Better Steering},
      author={Evgenii Kortukov and Piotr Komorowski and Florian Klein and Paula Engl and Gabriele Sarti and Seong Joon Oh and Sebastian Lapuschkin and Wojciech Samek},
      booktitle={Mechanistic Interpretability Workshop at ICML 2026},
      year={2026},
      url={https://openreview.net/forum?id=48NnVTsirb}
}