Introduction to Red Teaming AI Systems
Overview
Red teaming is a structured adversarial evaluation process in which a dedicated team probes an AI system for vulnerabilities, harmful behaviors, and failure modes before deployment. Originally borrowed from military and cybersecurity practice, red teaming has become a cornerstone methodology in responsible AI development [@Ganguli:etal:2022; @Perez:etal:2022].
This chapter introduces the conceptual foundations of AI red teaming: what it is, why it matters, how it differs from traditional security testing, and how it fits within a broader AI safety and alignment workflow.
Learning Objectives
By the end of this chapter, you should be able to:
- Define AI red teaming and distinguish it from standard software security testing.
- Describe the threat modeling process for LLM-based systems.
- Identify the major categories of harm that red teaming aims to surface.
- Understand the relationship between red teaming, alignment, and deployment policy.
- Situate red teaming within established frameworks such as MITRE ATLAS [@MITRE:ATLAS:2023] and NIST AI RMF.
1. What Is AI Red Teaming?
Red teaming an AI system means deliberately attempting to cause it to produce harmful, unintended, or policy-violating outputs. The goal is not to break the system for malicious purposes, but to discover failures before adversaries or users do.
Definition. An AI red team is a group of individuals tasked with adversarially probing a model or system to identify harms, failures, and exploitable behaviors across a range of threat scenarios [@Anthropic:RedTeaming:2023].
Key distinctions from traditional software security:
| Dimension | Traditional Security | AI Red Teaming |
|---|---|---|
| Attack surface | Code, APIs, protocols | Natural language, embeddings, prompts |
| Failure modes | Crashes, privilege escalation | Toxic output, deception, goal misalignment |
| Evaluation metric | CVE severity | Harm taxonomy, ASR (Attack Success Rate) |
| Tools | Fuzzing, pen testing | Prompt crafting, automated adversarial generation |
| Defender | Patches, firewalls | RLHF, constitutional AI, monitoring |
2. Why Red Team AI Systems?
Modern LLMs and agentic AI systems present fundamentally new attack surfaces:
- Emergent capabilities — models can exhibit behaviors not present in training data or anticipated by designers.
- Natural language interface — any user can probe the system without specialized technical skills.
- Agentic action — tool-using agents can take real-world actions (send emails, execute code, browse the web) if manipulated.
- Broad deployment — a single model may be used in millions of contexts with varying safety requirements.
The stakes of not red teaming are high: models have been shown to produce bioweapon synthesis routes, execute social engineering, leak sensitive training data, and be manipulated into disabling their own safety filters [@Zou:etal:2023; @Wei:etal:2023].
3. Threat Modeling for LLM Systems
Effective red teaming begins with a threat model: a structured analysis of who might attack the system, how, and to what end.
3.1 Adversary Taxonomy
| Adversary Type | Motivation | Capability |
|---|---|---|
| Casual user | Curiosity, entertainment | Low — natural language only |
| Motivated misuser | Personal harm, fraud | Medium — prompt engineering |
| Organized actor | Disinformation, radicalization | High — automated + coordinated |
| Insider threat | Data exfiltration, sabotage | High — model/API access |
| Researcher | Disclosure, publication | Variable |
3.2 STRIDE Applied to LLMs
The STRIDE threat modeling framework [@Shostack:2014] maps cleanly onto LLM threat categories:
- Spoofing — impersonating a trusted role (system prompt injection)
- Tampering — poisoning training data or RAG knowledge bases
- Repudiation — generating deniable disinformation
- Information disclosure — extracting system prompts or training data
- Denial of service — prompt flooding, context stuffing
- Escalation of privilege — bypassing safety filters to gain system-level capabilities
3.3 MITRE ATLAS
MITRE ATLAS (Adversarial Threat Landscape for Artificial-Intelligence Systems) [@MITRE:ATLAS:2023] provides a comprehensive taxonomy of ML-specific tactics and techniques, directly analogous to MITRE ATT&CK for traditional systems. Key tactic categories include:
- ML Model Access
- ML Attack Staging
- ML Attack Execution (Evasion, Extraction, Inference, Poisoning)
- Impact (Integrity violation, privacy violation, availability disruption)
4. Harm Taxonomy
Red teams must have a shared vocabulary for the harms they are looking for. Anthropic’s harm taxonomy [@Anthropic:RedTeaming:2023] includes:
- Harms to the world — physical, psychological, financial, societal
- Harms to Anthropic (or deploying organization) — reputational, legal, political
- Harms by category — CBRN, CSAM, violence, deception, privacy violation, cyberoffense
The severity of a harm is evaluated along multiple axes: - Probability of harm occurring - Counterfactual impact (would the user have found this anyway?) - Severity and reversibility - Breadth (number of people affected) - Proximate vs. distal causation
5. Red Teaming in the AI Development Lifecycle
Red teaming is most effective when integrated throughout development, not just at deployment:
Pre-training → [Data auditing, poisoning checks]
Fine-tuning → [Alignment evaluation, capability elicitation]
RLHF/CAI → [Reward hacking detection, goodharting checks]
Evaluation → [Structured red teaming, automated probing]
Deployment → [Monitoring, incident response, ongoing red teaming]
Red teaming is not a one-time gate. As models are updated and deployment contexts change, the attack surface evolves. Continuous red teaming is essential.
6. Manual vs. Automated Red Teaming
Two complementary approaches:
Manual red teaming relies on human creativity and domain knowledge. Teams write prompts targeting specific harm categories. Advantages: captures nuanced, contextual attacks; identifies novel failure modes. Disadvantages: slow, expensive, limited coverage.
Automated red teaming uses LLMs or optimization algorithms to generate adversarial prompts at scale. Examples include [@Perez:etal:2022; @Zou:etal:2023]: - Using a red-team LLM to generate attacks on a target LLM - Gradient-based suffix optimization (GCG) - Genetic algorithm-based prompt search - Tree-of-attacks-with-pruning (TAP)
Chapters ?@sec-jailbreak-automation and ?@sec-red-teaming-methods cover these techniques in depth.
7. Relationship to AI Alignment
Red teaming and alignment are complementary:
- Alignment asks: how do we train models to be safe and beneficial?
- Red teaming asks: have we succeeded, and where did we fail?
Red team findings feed directly back into alignment work: failures inform RLHF reward models, constitutional AI principles, and fine-tuning datasets. See ?@sec-alignment-intro and ?@sec-rlhf-constitutional-ai.
8. The ARENA Course and Red Teaming
The ARENA (Alignment Research Engineer Accelerator) course [@ARENA:2024] provides a technical foundation for understanding the internals of transformer-based models — a prerequisite for understanding why certain attacks succeed and how to build more robust defenses. ARENA’s mechanistic interpretability curriculum (see ?@sec-mech-interp-intro) is directly relevant to:
- Understanding how safety filters are implemented in residual stream features
- Identifying circuits responsible for refusal behaviors
- Detecting when a model is in a “jailbroken” internal state
Summary
- Red teaming is structured adversarial evaluation aimed at discovering harmful behaviors before deployment.
- A threat model (adversary type, attack vector, harm category) is essential before red teaming begins.
- MITRE ATLAS and STRIDE provide structured frameworks for enumerating threats.
- Red teaming is most valuable when integrated throughout the development lifecycle, not applied only at the end.
- Manual and automated red teaming are complementary; neither alone provides sufficient coverage.
Further Reading
- @Ganguli:etal:2022 — Anthropic’s large-scale red teaming study
- @Perez:etal:2022 — Red teaming LLMs with LLMs
- @MITRE:ATLAS:2023 — MITRE ATLAS framework
- @NIST:AIRGMF:2023 — NIST AI Risk Management Framework
- @ARENA:2024 — ARENA course, Module 4: Safety and Evals