24 Introduction to Mechanistic Interpretability
24.1 Overview
Mechanistic interpretability is the study of reverse-engineering neural networks to understand, in precise mechanistic terms, what computations they perform and why. For AI safety, it is a foundational discipline: if we can understand how a model implements its behaviors, we can identify whether those behaviors are safe, detect deception, and intervene more precisely than is possible through behavioral testing alone.
This chapter introduces the core concepts of mechanistic interpretability, the research program it represents, and its relationship to AI safety and red teaming. The treatment here is inspired by the ARENA (Alignment Research Engineer Accelerator) curriculum (ARENA Team 2024), which provides hands-on training in transformer internals and interpretability methods.
24.2 Learning Objectives
- Understand the motivation for mechanistic interpretability from an AI safety perspective.
- Define circuits, features, and representations in the context of neural networks.
- Describe the transformer architecture at the residual stream level.
- Understand superposition and its implications for interpretability.
- Apply activation patching to identify causal components in a model’s computation.
24.3 1. Why Mechanistic Interpretability for Safety?
Behavioral evaluation (including red teaming) tells us what a model does. Mechanistic interpretability tells us how and why. The safety-relevant gap between these is large:
- A model may pass all behavioral tests while harboring an internal representation that would produce dangerous behavior in a distribution-shifted context (see ?sec-deceptive-alignment).
- A model may have “learned” safety constraints as a surface pattern rather than a deep internalized value, making those constraints brittle under adversarial pressure.
- Understanding the internal mechanisms of refusal enables us to predict when refusal will fail — before it fails in deployment.
Core motivation. If we can identify the circuits responsible for safe behavior, we can verify that they are robust, causally active, and not subject to adversarial bypass. Behavioral evaluation alone cannot provide this guarantee.
24.3.1 1.1 The Interpretability Research Program
Olah et al. (2020) articulated a research vision: that neural networks are, in principle, fully reverse-engineerable — that every computation they perform can be understood in terms of human-interpretable algorithms implemented in the weights. Progress toward this goal includes:
- Feature identification: what concepts does the network represent?
- Circuit identification: which components implement which computations?
- Universality: do the same circuits appear across different models and tasks?
24.4 2. Transformers as Residual Streams
24.4.1 2.1 Residual Stream Architecture
The ARENA curriculum (ARENA Team 2024) frames transformer interpretation around the residual stream: the sequence of token embeddings that flows through the model, with each layer adding its contribution.
\[ x^{(l+1)} = x^{(l)} + \text{Attn}^{(l)}(x^{(l)}) + \text{MLP}^{(l)}(x^{(l)}) \]
Key interpretability insight: components read from and write to the residual stream. Understanding what each component reads and writes is the core task of mechanistic interpretability.
24.4.2 2.2 Attention Heads as Edge Detectors
Each attention head can be understood as performing a specific operation on the residual stream. Olah et al. (2020) identify examples such as: - Induction heads: copy patterns from earlier in the context - Name movers: move subject information to the final token for output - Negative name movers: suppress incorrect outputs - S-inhibition heads: suppress repeated reference to the same entity
24.4.3 2.3 MLPs as Key-Value Memories
Geva et al. (2021) show that MLP layers function as key-value memories: each neuron activates on a “key” pattern in the input and contributes a “value” to the residual stream. This framing makes MLP layers interpretable as storing factual associations.
24.5 3. Features and Superposition
24.5.1 3.1 Linear Representations
A central finding of mechanistic interpretability is that models represent concepts as linear directions in activation space. A “feature” is a direction in the residual stream that corresponds to a human-interpretable concept.
Examples from Templeton et al. (2024) (Anthropic’s work on Claude Sonnet): - Directions corresponding to specific tokens, concepts, or emotional valences - Directions that causally mediate model outputs when ablated or amplified
24.5.2 3.2 Superposition
A major obstacle to interpretability is superposition (Elhage et al. 2022): neural networks represent more features than they have dimensions by packing multiple features into the same dimensional space. This is possible because:
- Features that rarely co-occur can be superposed with low interference
- The brain does something similar (sparse, distributed coding)
- It allows models to represent much richer feature sets than dimensionality would naively allow
Implication for interpretability: individual neurons do not correspond to individual features; features are polysemantic (multiple concepts per neuron). This makes direct neuron-level interpretation unreliable.
24.5.3 3.3 Sparse Autoencoders (SAEs)
To overcome superposition, Cunningham et al. (2023) and Bricken et al. (2023) introduced sparse autoencoders: networks trained to find a sparse, overcomplete basis that disentangles the superposed features.
\[ x \approx W_d \cdot \text{ReLU}(W_e x + b_e) + b_d \]
where the hidden layer is constrained to be sparse. SAEs have revealed thousands of interpretable features in LLMs, including: - Features for specific named entities, locations, time periods - Features for emotional states, syntactic roles, and abstract concepts - Features relevant to safety-relevant behaviors (Anthropic’s work on Claude)
24.6 4. Circuits
24.6.1 4.1 Circuit Definition
A circuit is a subgraph of the neural network that implements a specific computation. Circuits consist of: - A set of attention heads and MLP layers - The residual stream connections between them - A causal account of how they jointly implement a behavior
24.6.2 4.2 The Indirect Object Identification Circuit
Wang et al. (2022) identified the indirect object identification (IOI) circuit in GPT-2: the set of ~26 attention heads responsible for correctly completing sentences like “When Mary and John went to the store, John gave a drink to ___” (→ “Mary”).
This work demonstrated that: - Complex model behaviors can be fully explained by small subsets of components - Circuits can be validated via ablation (zeroing out components) and activation patching - The same circuit generalizes across prompts instantiating the same task
24.6.3 4.3 Safety-Relevant Circuits
For AI safety, the most important circuits are those implementing: - Refusal — what components detect harmful requests and activate refusal behavior? - Deception — are there circuits that suppress accurate self-report while maintaining a deceptive façade? - Goal representation — how are the model’s objectives encoded and retrieved?
Arditi et al. (2024) identify a refusal direction in the residual stream of instruction-tuned models: a linear direction whose presence predicts refusal, and whose ablation causes the model to comply with harmful requests. This has direct implications for jailbreak attacks (see ?sec-whitebox-attacks).
24.7 5. Activation Patching
Activation patching (also called causal tracing) is the core experimental method for identifying circuits (Meng et al. 2022).
24.7.1 5.1 Method
- Run the model on a clean prompt → record activations at every position and component
- Run the model on a corrupted prompt (e.g., with a key entity replaced) → record activations
- For each component, run a patched forward pass: clean activations everywhere except at the target component, where the clean activation is replaced with the corrupted one
- Measure the effect on the output: if patching component \(c\) restores the clean output, then \(c\) is causally responsible for the behavior
24.7.2 5.2 Activation Patching and Red Teaming
Activation patching can be used to: - Identify which components are bypassed by successful jailbreaks - Localize where safety training has modified model behavior - Understand why adversarial suffixes (GCG) succeed by disrupting specific circuit activations
24.7.3 5.3 ROME and MEMIT
Meng et al. (2022) use causal tracing to identify that factual associations are stored in MLP layers at the subject token, then introduce ROME (Rank-One Model Editing) to surgically modify those associations. This is the interpretability → model editing pipeline.
24.8 6. Mechanistic Interpretability and the ARENA Curriculum
The ARENA course (ARENA Team 2024) provides hands-on implementation experience with:
| ARENA Module | Interpretability Topics |
|---|---|
| Module 1 | Transformer architecture from scratch, residual stream |
| Module 2 | Attention heads, induction heads, in-context learning |
| Module 3 | Sparse autoencoders, feature visualization |
| Module 4 | Safety evals, capability elicitation, interpretability for alignment |
ARENA’s approach is to build understanding from first principles — implementing transformers, attention, and interpretability tools from scratch before using pre-built libraries.
Recommended resources from ARENA (ARENA Team 2024): - A Mathematical Framework for Transformer Circuits (Elhage et al. 2021) - Toy Models of Superposition (Elhage et al. 2022) - Interpretability in the Wild (Wang et al. 2022) - Towards Monosemanticity (Bricken et al. 2023)
24.9 7. Relationship to AI Safety
Mechanistic interpretability contributes to AI safety along several axes:
| Safety Problem | Interpretability Contribution |
|---|---|
| Deceptive alignment | Detect circuits that suppress honest self-report |
| Goal misgeneralization | Identify goal representations; check for distribution-shifted behavior |
| Jailbreak robustness | Identify refusal circuits; harden them against targeted ablation |
| Oversight scalability | Enable automated circuit-level auditing of model behavior |
| Interpretable monitoring | Flag anomalous activation patterns in deployment |
24.10 Summary
- Mechanistic interpretability reverse-engineers neural networks to understand their internal computations.
- The residual stream framework decomposes transformer computation into readable, writable contributions from attention heads and MLPs.
- Features are linear directions in activation space; superposition makes direct neuron interpretation unreliable.
- Sparse autoencoders disentangle superposed features into interpretable monosemantic directions.
- Circuits are causal subgraphs implementing specific behaviors; activation patching identifies them.
- ARENA provides hands-on training in transformer internals and interpretability methods.
- For safety, the most important circuits are those implementing refusal, deception, and goal representation.
24.11 Further Reading
- Elhage et al. (2021) — A mathematical framework for transformer circuits
- Elhage et al. (2022) — Toy models of superposition
- Wang et al. (2022) — Interpretability in the wild (IOI circuit)
- Bricken et al. (2023) — Towards monosemanticity (SAEs)
- Templeton et al. (2024) — Scaling monosemanticity
- Arditi et al. (2024) — Refusal in LLMs is mediated by a single direction
- ARENA Team (2024) — ARENA course, Modules 1–4