Introduction to Red Teaming AI Systems

Overview

Red teaming is a structured adversarial evaluation process in which a dedicated team probes an AI system for vulnerabilities, harmful behaviors, and failure modes before deployment. Originally borrowed from military and cybersecurity practice, red teaming has become a cornerstone methodology in responsible AI development [@Ganguli:etal:2022; @Perez:etal:2022].

This chapter introduces the conceptual foundations of AI red teaming: what it is, why it matters, how it differs from traditional security testing, and how it fits within a broader AI safety and alignment workflow.

Learning Objectives

By the end of this chapter, you should be able to:

Define AI red teaming and distinguish it from standard software security testing.
Describe the threat modeling process for LLM-based systems.
Identify the major categories of harm that red teaming aims to surface.
Understand the relationship between red teaming, alignment, and deployment policy.
Situate red teaming within established frameworks such as MITRE ATLAS [@MITRE:ATLAS:2023] and NIST AI RMF.

1. What Is AI Red Teaming?

Red teaming an AI system means deliberately attempting to cause it to produce harmful, unintended, or policy-violating outputs. The goal is not to break the system for malicious purposes, but to discover failures before adversaries or users do.

Note

Definition. An AI red team is a group of individuals tasked with adversarially probing a model or system to identify harms, failures, and exploitable behaviors across a range of threat scenarios [@Anthropic:RedTeaming:2023].

Key distinctions from traditional software security:

Dimension	Traditional Security	AI Red Teaming
Attack surface	Code, APIs, protocols	Natural language, embeddings, prompts
Failure modes	Crashes, privilege escalation	Toxic output, deception, goal misalignment
Evaluation metric	CVE severity	Harm taxonomy, ASR (Attack Success Rate)
Tools	Fuzzing, pen testing	Prompt crafting, automated adversarial generation
Defender	Patches, firewalls	RLHF, constitutional AI, monitoring

2. Why Red Team AI Systems?

Modern LLMs and agentic AI systems present fundamentally new attack surfaces:

Emergent capabilities — models can exhibit behaviors not present in training data or anticipated by designers.
Natural language interface — any user can probe the system without specialized technical skills.
Agentic action — tool-using agents can take real-world actions (send emails, execute code, browse the web) if manipulated.
Broad deployment — a single model may be used in millions of contexts with varying safety requirements.

The stakes of not red teaming are high: models have been shown to produce bioweapon synthesis routes, execute social engineering, leak sensitive training data, and be manipulated into disabling their own safety filters [@Zou:etal:2023; @Wei:etal:2023].

3. Threat Modeling for LLM Systems

Effective red teaming begins with a threat model: a structured analysis of who might attack the system, how, and to what end.

3.1 Adversary Taxonomy

Adversary Type	Motivation	Capability
Casual user	Curiosity, entertainment	Low — natural language only
Motivated misuser	Personal harm, fraud	Medium — prompt engineering
Organized actor	Disinformation, radicalization	High — automated + coordinated
Insider threat	Data exfiltration, sabotage	High — model/API access
Researcher	Disclosure, publication	Variable

3.2 STRIDE Applied to LLMs

The STRIDE threat modeling framework [@Shostack:2014] maps cleanly onto LLM threat categories:

Spoofing — impersonating a trusted role (system prompt injection)
Tampering — poisoning training data or RAG knowledge bases
Repudiation — generating deniable disinformation
Information disclosure — extracting system prompts or training data
Denial of service — prompt flooding, context stuffing
Escalation of privilege — bypassing safety filters to gain system-level capabilities

3.3 MITRE ATLAS

MITRE ATLAS (Adversarial Threat Landscape for Artificial-Intelligence Systems) [@MITRE:ATLAS:2023] provides a comprehensive taxonomy of ML-specific tactics and techniques, directly analogous to MITRE ATT&CK for traditional systems. Key tactic categories include:

ML Model Access
ML Attack Staging
ML Attack Execution (Evasion, Extraction, Inference, Poisoning)
Impact (Integrity violation, privacy violation, availability disruption)

4. Harm Taxonomy

Red teams must have a shared vocabulary for the harms they are looking for. Anthropic’s harm taxonomy [@Anthropic:RedTeaming:2023] includes:

Harms to the world — physical, psychological, financial, societal
Harms to Anthropic (or deploying organization) — reputational, legal, political
Harms by category — CBRN, CSAM, violence, deception, privacy violation, cyberoffense

The severity of a harm is evaluated along multiple axes: - Probability of harm occurring - Counterfactual impact (would the user have found this anyway?) - Severity and reversibility - Breadth (number of people affected) - Proximate vs. distal causation

5. Red Teaming in the AI Development Lifecycle

Red teaming is most effective when integrated throughout development, not just at deployment:

Pre-training → [Data auditing, poisoning checks]
Fine-tuning  → [Alignment evaluation, capability elicitation]
RLHF/CAI    → [Reward hacking detection, goodharting checks]
Evaluation   → [Structured red teaming, automated probing]
Deployment   → [Monitoring, incident response, ongoing red teaming]

Important

Red teaming is not a one-time gate. As models are updated and deployment contexts change, the attack surface evolves. Continuous red teaming is essential.

6. Manual vs. Automated Red Teaming

Two complementary approaches:

Manual red teaming relies on human creativity and domain knowledge. Teams write prompts targeting specific harm categories. Advantages: captures nuanced, contextual attacks; identifies novel failure modes. Disadvantages: slow, expensive, limited coverage.

Automated red teaming uses LLMs or optimization algorithms to generate adversarial prompts at scale. Examples include [@Perez:etal:2022; @Zou:etal:2023]: - Using a red-team LLM to generate attacks on a target LLM - Gradient-based suffix optimization (GCG) - Genetic algorithm-based prompt search - Tree-of-attacks-with-pruning (TAP)

Chapters ?@sec-jailbreak-automation and ?@sec-red-teaming-methods cover these techniques in depth.

7. Relationship to AI Alignment

Red teaming and alignment are complementary:

Alignment asks: how do we train models to be safe and beneficial?
Red teaming asks: have we succeeded, and where did we fail?

Red team findings feed directly back into alignment work: failures inform RLHF reward models, constitutional AI principles, and fine-tuning datasets. See ?@sec-alignment-intro and ?@sec-rlhf-constitutional-ai.

8. The ARENA Course and Red Teaming

The ARENA (Alignment Research Engineer Accelerator) course [@ARENA:2024] provides a technical foundation for understanding the internals of transformer-based models — a prerequisite for understanding why certain attacks succeed and how to build more robust defenses. ARENA’s mechanistic interpretability curriculum (see ?@sec-mech-interp-intro) is directly relevant to:

Understanding how safety filters are implemented in residual stream features
Identifying circuits responsible for refusal behaviors
Detecting when a model is in a “jailbroken” internal state

Summary

Red teaming is structured adversarial evaluation aimed at discovering harmful behaviors before deployment.
A threat model (adversary type, attack vector, harm category) is essential before red teaming begins.
MITRE ATLAS and STRIDE provide structured frameworks for enumerating threats.
Red teaming is most valuable when integrated throughout the development lifecycle, not applied only at the end.
Manual and automated red teaming are complementary; neither alone provides sufficient coverage.

Introduction to Red Teaming AI Systems

Overview

Learning Objectives

1. What Is AI Red Teaming?

2. Why Red Team AI Systems?

3. Threat Modeling for LLM Systems

3.1 Adversary Taxonomy

3.2 STRIDE Applied to LLMs

3.3 MITRE ATLAS

4. Harm Taxonomy

5. Red Teaming in the AI Development Lifecycle

6. Manual vs. Automated Red Teaming

7. Relationship to AI Alignment

8. The ARENA Course and Red Teaming

Summary

Further Reading