What Is Mechanistic Interpretability?
Modern AI models work astonishingly well and nobody fully knows how. Mechanistic interpretability is the effort to change that - to open the model up and read the machinery inside.
Foundations · ~12 min read
The black-box problem
A large language model is a few hundred billion numbers - the weights - arranged into layers. We didn't program those numbers. We set up a learning process, poured in trillions of words, and the numbers arranged themselves until the model got good at predicting text. The result writes code, passes exams, and reasons through problems.
Here is the uncomfortable part: we can see every weight and still not know what the model is doing. The algorithm it learned isn't written down anywhere a human can read it. It's smeared across billions of parameters. That's the black box.
This matters more every month. We're handing these systems tools, file access, databases, and the ability to act on their own. “It usually works” is fine for autocomplete. It is not fine for a system that can email your customers or run code on your server.
The one-sentence version
Mechanistic Interpretability is the science of reverse-engineering a neural network into human-understandable parts - figuring out which components do what, in what order, and by what algorithm. The analogy researchers use is decompiling a program with no source code: you have the binary (the weights) and you want the source back (the algorithm).
Interpretability vs. explainability vs. observability
These get used interchangeably and shouldn't be:
- Explainability studies the model from the outside - which inputs influenced an output. It describes correlations, not the mechanism.
- Observability is operational - what the system actually did in production, via logs and traces.
- Mechanistic Interpretability goes inside and asks how the network computes the output - which components fire, how they connect, and what algorithm they implement.
So when Anthropic names its team simply “Interpretability,” that's not a land-grab - it's the broader umbrella. Mechanistic interpretability is the specific, ambitious subfield underneath it: not just “explain the behavior” but “reverse-engineer the circuit.”
Features
A feature is a concept the model represents internally - a direction in its activation space that lights up for some human-meaningful thing. There are features for “the Golden Gate Bridge,” for “Python code,” for “this text is in French,” for “the tone is angry.” If you could cleanly read a model's features as it ran, you'd have a live readout of what it's thinking about. The obstacle is the next idea.
Superposition & polysemanticity
You'd hope each neuron stood for one clean concept. It doesn't. A single neuron fires for many unrelated things - that mixing is called polysemanticity. The reason is superposition: the model has more concepts to represent than it has neurons, so it packs many features into overlapping, non-perpendicular directions in the same space. It's compression - and it's why you can't just point at a neuron and read its meaning.
Sparse autoencoders
Superposition is the problem. Sparse autoencoders (SAEs) are the breakthrough that started cracking it. An SAE is a second, small network trained on the big model's activations. Its job: re-express a dense, superposed activation as a much larger but sparse set of features - thousands of dimensions, only a handful active at once. Because it's overcomplete and sparse, it tends to un-mix superposition and surface features that are far more monosemantic - one feature, one concept.
The famous demonstration: researchers found a single feature meaning “the Golden Gate Bridge,” clamped it up, and watched the model become obsessed with the bridge. That experiment showed SAE features are real, causal handles on behavior - not just statistical curiosities.
Circuits
Features are the nouns; circuits are the verbs. A circuit is a connected subgraph of features and the connections between them that together implement a specific computation - the model's learned “code path” for one kind of task. Described in raw neurons, circuits are a mess; described in SAE features, they get clean enough to write down and verify.
Induction heads
The most famous circuit. An induction head implements a simple but powerful rule inside the attention layers: if the pattern A→B appeared earlier, and I just saw A again, predict B. It's pattern-completion, and it's a big part of how models do in-context learning - picking up a format from your prompt and continuing it without any retraining.
Activation steering
Once you can name a feature, you can push on it. Activation steering means adding or subtracting a feature's direction in the model's activations at runtime to change behavior - more cautious, more formal, more honest, more on-topic - with no retraining. This is where interpretability stops being purely diagnostic and becomes a control surface. It's also why it matters for safety: the same handle that steers a model toward “honest” can steer it away from its guardrails.
Why builders should care
You might be building on a closed API, not training models. Here's why it's still your problem:
- It's the vocabulary of trust. “Aligned,” “steerable,” “monitored” all cash out in interpretability work. Knowing the terms lets you ask sharper questions.
- The techniques are leaking into tooling. Feature monitoring and steering are becoming product features. The builders who understand them first will use them first.
- It reframes debugging. You start asking what the system is actually representing instead of just retrying the prompt - and that mindset makes agents reliable.
One honest line we hold everywhere on this site: if you're building on a closed API you can't do true Mechanistic Interpretability on it - you don't have the weights. What you can do is understand the system around it: tools, prompts, retrieval, permissions, and behavior. We cover both layers, and we always say which one you're getting.
Next: How LLMs Actually Work → · or browse the glossary.