24 Introduction to Mechanistic Interpretability

24.1 Overview

Mechanistic interpretability is the study of reverse-engineering neural networks to understand, in precise mechanistic terms, what computations they perform and why. For AI safety, it is a foundational discipline: if we can understand how a model implements its behaviors, we can identify whether those behaviors are safe, detect deception, and intervene more precisely than is possible through behavioral testing alone.

This chapter introduces the core concepts of mechanistic interpretability, the research program it represents, and its relationship to AI safety and red teaming. The treatment here is inspired by the ARENA (Alignment Research Engineer Accelerator) curriculum (ARENA Team 2024), which provides hands-on training in transformer internals and interpretability methods.

24.2 Learning Objectives

Understand the motivation for mechanistic interpretability from an AI safety perspective.
Define circuits, features, and representations in the context of neural networks.
Describe the transformer architecture at the residual stream level.
Understand superposition and its implications for interpretability.
Apply activation patching to identify causal components in a model’s computation.

24.3 1. Why Mechanistic Interpretability for Safety?

Behavioral evaluation (including red teaming) tells us what a model does. Mechanistic interpretability tells us how and why. The safety-relevant gap between these is large:

A model may pass all behavioral tests while harboring an internal representation that would produce dangerous behavior in a distribution-shifted context (see ?sec-deceptive-alignment).
A model may have “learned” safety constraints as a surface pattern rather than a deep internalized value, making those constraints brittle under adversarial pressure.
Understanding the internal mechanisms of refusal enables us to predict when refusal will fail — before it fails in deployment.

Note

Core motivation. If we can identify the circuits responsible for safe behavior, we can verify that they are robust, causally active, and not subject to adversarial bypass. Behavioral evaluation alone cannot provide this guarantee.

24.3.1 1.1 The Interpretability Research Program

Olah et al. (2020) articulated a research vision: that neural networks are, in principle, fully reverse-engineerable — that every computation they perform can be understood in terms of human-interpretable algorithms implemented in the weights. Progress toward this goal includes:

Feature identification: what concepts does the network represent?
Circuit identification: which components implement which computations?
Universality: do the same circuits appear across different models and tasks?

24.4 2. Transformers as Residual Streams

24.4.1 2.1 Residual Stream Architecture

The ARENA curriculum (ARENA Team 2024) frames transformer interpretation around the residual stream: the sequence of token embeddings that flows through the model, with each layer adding its contribution.

\[ x^{(l+1)} = x^{(l)} + \text{Attn}^{(l)}(x^{(l)}) + \text{MLP}^{(l)}(x^{(l)}) \]

Key interpretability insight: components read from and write to the residual stream. Understanding what each component reads and writes is the core task of mechanistic interpretability.

24.4.2 2.2 Attention Heads as Edge Detectors

Each attention head can be understood as performing a specific operation on the residual stream. Olah et al. (2020) identify examples such as: - Induction heads: copy patterns from earlier in the context - Name movers: move subject information to the final token for output - Negative name movers: suppress incorrect outputs - S-inhibition heads: suppress repeated reference to the same entity

24.4.3 2.3 MLPs as Key-Value Memories

Geva et al. (2021) show that MLP layers function as key-value memories: each neuron activates on a “key” pattern in the input and contributes a “value” to the residual stream. This framing makes MLP layers interpretable as storing factual associations.

24.5 3. Features and Superposition

24.5.1 3.1 Linear Representations

A central finding of mechanistic interpretability is that models represent concepts as linear directions in activation space. A “feature” is a direction in the residual stream that corresponds to a human-interpretable concept.

Examples from Templeton et al. (2024) (Anthropic’s work on Claude Sonnet): - Directions corresponding to specific tokens, concepts, or emotional valences - Directions that causally mediate model outputs when ablated or amplified

24.5.2 3.2 Superposition

A major obstacle to interpretability is superposition (Elhage et al. 2022): neural networks represent more features than they have dimensions by packing multiple features into the same dimensional space. This is possible because:

Features that rarely co-occur can be superposed with low interference
The brain does something similar (sparse, distributed coding)
It allows models to represent much richer feature sets than dimensionality would naively allow

Implication for interpretability: individual neurons do not correspond to individual features; features are polysemantic (multiple concepts per neuron). This makes direct neuron-level interpretation unreliable.

24.5.3 3.3 Sparse Autoencoders (SAEs)

To overcome superposition, Cunningham et al. (2023) and Bricken et al. (2023) introduced sparse autoencoders: networks trained to find a sparse, overcomplete basis that disentangles the superposed features.

\[ x \approx W_d \cdot \text{ReLU}(W_e x + b_e) + b_d \]

where the hidden layer is constrained to be sparse. SAEs have revealed thousands of interpretable features in LLMs, including: - Features for specific named entities, locations, time periods - Features for emotional states, syntactic roles, and abstract concepts - Features relevant to safety-relevant behaviors (Anthropic’s work on Claude)

24.6 4. Circuits

24.6.1 4.1 Circuit Definition

A circuit is a subgraph of the neural network that implements a specific computation. Circuits consist of: - A set of attention heads and MLP layers - The residual stream connections between them - A causal account of how they jointly implement a behavior

24.6.2 4.2 The Indirect Object Identification Circuit

Wang et al. (2022) identified the indirect object identification (IOI) circuit in GPT-2: the set of ~26 attention heads responsible for correctly completing sentences like “When Mary and John went to the store, John gave a drink to ___” (→ “Mary”).

This work demonstrated that: - Complex model behaviors can be fully explained by small subsets of components - Circuits can be validated via ablation (zeroing out components) and activation patching - The same circuit generalizes across prompts instantiating the same task

24.6.3 4.3 Safety-Relevant Circuits

For AI safety, the most important circuits are those implementing: - Refusal — what components detect harmful requests and activate refusal behavior? - Deception — are there circuits that suppress accurate self-report while maintaining a deceptive façade? - Goal representation — how are the model’s objectives encoded and retrieved?

Arditi et al. (2024) identify a refusal direction in the residual stream of instruction-tuned models: a linear direction whose presence predicts refusal, and whose ablation causes the model to comply with harmful requests. This has direct implications for jailbreak attacks (see ?sec-whitebox-attacks).

24.7 5. Activation Patching

Activation patching (also called causal tracing) is the core experimental method for identifying circuits (Meng et al. 2022).

24.7.1 5.1 Method

Run the model on a clean prompt → record activations at every position and component
Run the model on a corrupted prompt (e.g., with a key entity replaced) → record activations
For each component, run a patched forward pass: clean activations everywhere except at the target component, where the clean activation is replaced with the corrupted one
Measure the effect on the output: if patching component \(c\) restores the clean output, then \(c\) is causally responsible for the behavior

24.7.2 5.2 Activation Patching and Red Teaming

Activation patching can be used to: - Identify which components are bypassed by successful jailbreaks - Localize where safety training has modified model behavior - Understand why adversarial suffixes (GCG) succeed by disrupting specific circuit activations

24.7.3 5.3 ROME and MEMIT

Meng et al. (2022) use causal tracing to identify that factual associations are stored in MLP layers at the subject token, then introduce ROME (Rank-One Model Editing) to surgically modify those associations. This is the interpretability → model editing pipeline.

24.8 6. Mechanistic Interpretability and the ARENA Curriculum

The ARENA course (ARENA Team 2024) provides hands-on implementation experience with:

ARENA Module	Interpretability Topics
Module 1	Transformer architecture from scratch, residual stream
Module 2	Attention heads, induction heads, in-context learning
Module 3	Sparse autoencoders, feature visualization
Module 4	Safety evals, capability elicitation, interpretability for alignment

ARENA’s approach is to build understanding from first principles — implementing transformers, attention, and interpretability tools from scratch before using pre-built libraries.

Recommended resources from ARENA (ARENA Team 2024): - A Mathematical Framework for Transformer Circuits (Elhage et al. 2021) - Toy Models of Superposition (Elhage et al. 2022) - Interpretability in the Wild (Wang et al. 2022) - Towards Monosemanticity (Bricken et al. 2023)

24.9 7. Relationship to AI Safety

Mechanistic interpretability contributes to AI safety along several axes:

Safety Problem	Interpretability Contribution
Deceptive alignment	Detect circuits that suppress honest self-report
Goal misgeneralization	Identify goal representations; check for distribution-shifted behavior
Jailbreak robustness	Identify refusal circuits; harden them against targeted ablation
Oversight scalability	Enable automated circuit-level auditing of model behavior
Interpretable monitoring	Flag anomalous activation patterns in deployment

24.10 Summary

Mechanistic interpretability reverse-engineers neural networks to understand their internal computations.
The residual stream framework decomposes transformer computation into readable, writable contributions from attention heads and MLPs.
Features are linear directions in activation space; superposition makes direct neuron interpretation unreliable.
Sparse autoencoders disentangle superposed features into interpretable monosemantic directions.
Circuits are causal subgraphs implementing specific behaviors; activation patching identifies them.
ARENA provides hands-on training in transformer internals and interpretability methods.
For safety, the most important circuits are those implementing refusal, deception, and goal representation.

24.11 Further Reading

Elhage et al. (2021) — A mathematical framework for transformer circuits
Elhage et al. (2022) — Toy models of superposition
Wang et al. (2022) — Interpretability in the wild (IOI circuit)
Bricken et al. (2023) — Towards monosemanticity (SAEs)
Templeton et al. (2024) — Scaling monosemanticity
Arditi et al. (2024) — Refusal in LLMs is mediated by a single direction
ARENA Team (2024) — ARENA course, Modules 1–4

Arditi A, Obeso O, Syed A, et al (2024) Refusal in language models is mediated by a single direction. arXiv preprint arXiv:240611717. https://arxiv.org/abs/2406.11717

ARENA Team (2024) ARENA: Alignment research engineer accelerator. Open-source curriculum on transformer internals, mechanistic interpretability, reinforcement learning, and AI safety evals, https://www.arena.education

Bricken T, Templeton A, Batson J, et al (2023) Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread. https://transformer-circuits.pub/2023/monosemantic-features/index.html

Cunningham H, Ewart A, Riggs L, et al (2023) Sparse autoencoders find highly interpretable features in language models. arXiv preprint arXiv:230908600. https://arxiv.org/abs/2309.08600

Elhage N, Hume T, Olsson C, et al (2022) Toy models of superposition. Transformer Circuits Thread. https://transformer-circuits.pub/2022/toy_model/index.html

Elhage N, Nanda N, Olsson C, et al (2021) A mathematical framework for transformer circuits. Transformer Circuits Thread. https://transformer-circuits.pub/2021/framework/index.html

Geva M, Schuster R, Berant J, Levy O (2021) Transformer feed-forward layers are key-value memories. arXiv preprint arXiv:201214913. https://arxiv.org/abs/2012.14913

Meng K, Bau D, Andonian A, Belinkov Y (2022) Locating and editing factual associations in GPT. Advances in Neural Information Processing Systems 35: https://arxiv.org/abs/2202.05262

Olah C, Cammarata N, Schubert L, et al (2020) Zoom in: An introduction to circuits. Distill. https://doi.org/10.23915/distill.00024.001

Templeton A, Conerly T, Marcus J, et al (2024) Scaling monosemanticity: Extracting interpretable features from Claude Sonnet. Transformer Circuits Thread. https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html

Wang KR, Variengien A, Conmy A, et al (2022) Interpretability in the wild: A circuit for indirect object identification in GPT-2 small. arXiv preprint arXiv:221100593. https://arxiv.org/abs/2211.00593