Essential AI

Rethinking Reflection in Pre-Training

Apr 23, 2025

In this post, we want to share some findings from our recent research at Essential AI that might challenge a commonly held belief: that reflection—the ability of language models to recognize and correct their own mistakes—only emerges during fine-tuning or reinforcement learning.

We set out to investigate whether this reflective ability might actually begin to take shape much earlier—during pre-training itself. To our surprise, the answer appears to be yes.

What We Mean by "Reflection"

In everyday use, reflection means thinking about how we think—noticing a mistake in some reasoning and correcting course, if necessary. For language models, we define reflection similarly: can a model detect flaws in reasoning and revise its response?

This is more than a philosophical question. If we want language models that are accurate, adaptive, and capable of backtracking from their mistakes, reflection is a crucial building block. The earlier we can cultivate it, the more efficiently we can train models and the more useful they can become.

Our Approach: Measuring Reflection During Pre-Training

To test for reflection, we created a set of adversarial datasets across a variety of domains, including math, code, and logical reasoning. Each dataset contains misleading or incorrect "chains-of-thought"—reasoning steps that were somewhat plausible but led to the wrong answer.

We tested whether models could navigate past the misleading reasoning and arrive at the correct answer. In doing so, we studied two distinct forms of reflection:

Situational reflection: when a model overcomes incorrect prior reasoning from another source, such as a different model.
Self-reflection: when it corrects its own previous reasoning.

To encourage this behavior in the models, we used the phrase "Wait," as a simple trigger—analogously to a person realizing something's off and pausing to rethink.

What We Found

1. Reflection Starts Earlier Than We Expected

We found evidence of reflection even in relatively small models, early in their training. For example, an OLMo-2 model with 7 billion parameters and just 198 billion training tokens was able to recognize and correct errors in reasoning.

With additional training compute, the ability to reflect became stronger and more consistent.

Graph showing increasing reflection capabilities with pre-training compute

Figure 1: As pre-training compute increases, models are more likely to solve adversarial tasks using explicit reflection. Log Pre-Training Compute is calculated as log(6 x number of model parameters x number of training tokens).

2. A Simple Prompt Like "Wait," Makes a Big Difference

We found that appending a single word—"Wait,"—was often enough to prompt the model to reconsider and correct its answer. This significantly improved outcomes and increased the rate of explicit reflection.

This suggests reflection is a capability that can be elicited, not just learned, and that models may already have some latent capacity for it during training.

Comparison showing the effect of various reflection prompts

Figure 2: Models can reflect and correct situational adversaries even without triggers, but the "Wait," trigger boosts explicit reflection almost as much as heavier-handed prompts like "Wait, I made a mistake".

3. Self-Reflection Is Harder, but It Emerges Too

When we asked models to review their own earlier mistakes—essentially giving them a second chance on problems they initially got wrong—we saw a lower success rate, which makes sense given that, by definition, these problems were harder: the models had already failed to answer them correctly the first time.

Nonetheless, the presence of reflection increased noticeably with training. In some tasks, models began reflecting on their prior reasoning before they consistently corrected it. That suggests reflection may emerge before, or in tandem with, deeper reasoning capabilities.

Graph showing self-reflection in code tasks

Growth of self-correction with pre-training

Figure 3: On a code task, we see small but noticeable amounts of self-reflection followed by self-correction, growing with additional pre-training.

Why This Matters

If models can demonstrate reflection during pre-training, that changes how we might think about model development and evaluation.

Training strategy: Our measurements may offer paths to target specific datasets and data curricula that improve reflection at scale.
Evaluation: Traditional benchmarks often miss subtle early signs of reflection that model creators might want to nurture during pre-training. Our framework introduces ways to detect and distinguish between implicit and explicit reflection.
Safety and alignment: A model that can spot and fix its own mistakes is a model that's easier to trust, especially in applications where reliability is critical.

Looking Ahead

This work is a first step in a broader exploration. There's much more to learn about how and why reflection arises—what types of training data promote it, what kinds of reasoning are easiest to reflect on, and how reflection during pre-training connects to more advanced reasoning at inference time.

We hope the tools and datasets we've released can help others explore these questions too.

Resources

📄

Full Paper on arXiv

🧰

Code and Datasets on GitHub

🔍

Eval Visualizer