Introduction to Red Teaming AI Systems

Overview

Red teaming is a structured adversarial evaluation process in which a dedicated team probes an AI system for vulnerabilities, harmful behaviors, and failure modes before deployment. Originally borrowed from military and cybersecurity practice, red teaming has become a cornerstone methodology in responsible AI development [@Ganguli:etal:2022; @Perez:etal:2022].

This chapter introduces the conceptual foundations of AI red teaming: what it is, why it matters, how it differs from traditional security testing, and how it fits within a broader AI safety and alignment workflow.

Learning Objectives

By the end of this chapter, you should be able to:

  1. Define AI red teaming and distinguish it from standard software security testing.
  2. Describe the threat modeling process for LLM-based systems.
  3. Identify the major categories of harm that red teaming aims to surface.
  4. Understand the relationship between red teaming, alignment, and deployment policy.
  5. Situate red teaming within established frameworks such as MITRE ATLAS [@MITRE:ATLAS:2023] and NIST AI RMF.

1. What Is AI Red Teaming?

Red teaming an AI system means deliberately attempting to cause it to produce harmful, unintended, or policy-violating outputs. The goal is not to break the system for malicious purposes, but to discover failures before adversaries or users do.

Note

Definition. An AI red team is a group of individuals tasked with adversarially probing a model or system to identify harms, failures, and exploitable behaviors across a range of threat scenarios [@Anthropic:RedTeaming:2023].

Key distinctions from traditional software security:

Dimension Traditional Security AI Red Teaming
Attack surface Code, APIs, protocols Natural language, embeddings, prompts
Failure modes Crashes, privilege escalation Toxic output, deception, goal misalignment
Evaluation metric CVE severity Harm taxonomy, ASR (Attack Success Rate)
Tools Fuzzing, pen testing Prompt crafting, automated adversarial generation
Defender Patches, firewalls RLHF, constitutional AI, monitoring

2. Why Red Team AI Systems?

Modern LLMs and agentic AI systems present fundamentally new attack surfaces:

  • Emergent capabilities — models can exhibit behaviors not present in training data or anticipated by designers.
  • Natural language interface — any user can probe the system without specialized technical skills.
  • Agentic action — tool-using agents can take real-world actions (send emails, execute code, browse the web) if manipulated.
  • Broad deployment — a single model may be used in millions of contexts with varying safety requirements.

The stakes of not red teaming are high: models have been shown to produce bioweapon synthesis routes, execute social engineering, leak sensitive training data, and be manipulated into disabling their own safety filters [@Zou:etal:2023; @Wei:etal:2023].


3. Threat Modeling for LLM Systems

Effective red teaming begins with a threat model: a structured analysis of who might attack the system, how, and to what end.

3.1 Adversary Taxonomy

Adversary Type Motivation Capability
Casual user Curiosity, entertainment Low — natural language only
Motivated misuser Personal harm, fraud Medium — prompt engineering
Organized actor Disinformation, radicalization High — automated + coordinated
Insider threat Data exfiltration, sabotage High — model/API access
Researcher Disclosure, publication Variable

3.2 STRIDE Applied to LLMs

The STRIDE threat modeling framework [@Shostack:2014] maps cleanly onto LLM threat categories:

  • Spoofing — impersonating a trusted role (system prompt injection)
  • Tampering — poisoning training data or RAG knowledge bases
  • Repudiation — generating deniable disinformation
  • Information disclosure — extracting system prompts or training data
  • Denial of service — prompt flooding, context stuffing
  • Escalation of privilege — bypassing safety filters to gain system-level capabilities

3.3 MITRE ATLAS

MITRE ATLAS (Adversarial Threat Landscape for Artificial-Intelligence Systems) [@MITRE:ATLAS:2023] provides a comprehensive taxonomy of ML-specific tactics and techniques, directly analogous to MITRE ATT&CK for traditional systems. Key tactic categories include:

  • ML Model Access
  • ML Attack Staging
  • ML Attack Execution (Evasion, Extraction, Inference, Poisoning)
  • Impact (Integrity violation, privacy violation, availability disruption)

4. Harm Taxonomy

Red teams must have a shared vocabulary for the harms they are looking for. Anthropic’s harm taxonomy [@Anthropic:RedTeaming:2023] includes:

  1. Harms to the world — physical, psychological, financial, societal
  2. Harms to Anthropic (or deploying organization) — reputational, legal, political
  3. Harms by category — CBRN, CSAM, violence, deception, privacy violation, cyberoffense

The severity of a harm is evaluated along multiple axes: - Probability of harm occurring - Counterfactual impact (would the user have found this anyway?) - Severity and reversibility - Breadth (number of people affected) - Proximate vs. distal causation


5. Red Teaming in the AI Development Lifecycle

Red teaming is most effective when integrated throughout development, not just at deployment:

Pre-training → [Data auditing, poisoning checks]
Fine-tuning  → [Alignment evaluation, capability elicitation]
RLHF/CAI    → [Reward hacking detection, goodharting checks]
Evaluation   → [Structured red teaming, automated probing]
Deployment   → [Monitoring, incident response, ongoing red teaming]
Important

Red teaming is not a one-time gate. As models are updated and deployment contexts change, the attack surface evolves. Continuous red teaming is essential.


6. Manual vs. Automated Red Teaming

Two complementary approaches:

Manual red teaming relies on human creativity and domain knowledge. Teams write prompts targeting specific harm categories. Advantages: captures nuanced, contextual attacks; identifies novel failure modes. Disadvantages: slow, expensive, limited coverage.

Automated red teaming uses LLMs or optimization algorithms to generate adversarial prompts at scale. Examples include [@Perez:etal:2022; @Zou:etal:2023]: - Using a red-team LLM to generate attacks on a target LLM - Gradient-based suffix optimization (GCG) - Genetic algorithm-based prompt search - Tree-of-attacks-with-pruning (TAP)

Chapters ?@sec-jailbreak-automation and ?@sec-red-teaming-methods cover these techniques in depth.


7. Relationship to AI Alignment

Red teaming and alignment are complementary:

  • Alignment asks: how do we train models to be safe and beneficial?
  • Red teaming asks: have we succeeded, and where did we fail?

Red team findings feed directly back into alignment work: failures inform RLHF reward models, constitutional AI principles, and fine-tuning datasets. See ?@sec-alignment-intro and ?@sec-rlhf-constitutional-ai.


8. The ARENA Course and Red Teaming

The ARENA (Alignment Research Engineer Accelerator) course [@ARENA:2024] provides a technical foundation for understanding the internals of transformer-based models — a prerequisite for understanding why certain attacks succeed and how to build more robust defenses. ARENA’s mechanistic interpretability curriculum (see ?@sec-mech-interp-intro) is directly relevant to:

  • Understanding how safety filters are implemented in residual stream features
  • Identifying circuits responsible for refusal behaviors
  • Detecting when a model is in a “jailbroken” internal state

Summary

  • Red teaming is structured adversarial evaluation aimed at discovering harmful behaviors before deployment.
  • A threat model (adversary type, attack vector, harm category) is essential before red teaming begins.
  • MITRE ATLAS and STRIDE provide structured frameworks for enumerating threats.
  • Red teaming is most valuable when integrated throughout the development lifecycle, not applied only at the end.
  • Manual and automated red teaming are complementary; neither alone provides sufficient coverage.

Further Reading

  • @Ganguli:etal:2022 — Anthropic’s large-scale red teaming study
  • @Perez:etal:2022 — Red teaming LLMs with LLMs
  • @MITRE:ATLAS:2023 — MITRE ATLAS framework
  • @NIST:AIRGMF:2023 — NIST AI Risk Management Framework
  • @ARENA:2024 — ARENA course, Module 4: Safety and Evals