This article defines the structure of vendor-model red teaming, distinguishes it from the broader concept of red teaming applied to systems the deployer builds itself, anchors the practice to current standards, and explains the operational realities of doing it well within the time and access constraints that procured models impose.
Why Vendor-Supplied Evidence Is Insufficient
A vendor’s evaluation cannot anticipate the deployer’s threat model. Three asymmetries explain this.
The first is distribution shift. Public benchmarks and provider-internal evaluations sample distributions that are unlikely to match the deployer’s domain. A model that scores 92 percent on a generic question-answering benchmark may score 60 percent on the deployer’s actual customer queries.
The second is prompt-template specificity. Real deployments wrap models in system prompts, retrieval pipelines, and tool integrations. The model’s behaviour inside that wrapper is not the model’s behaviour in isolation. The U.S. National Institute of Standards and Technology (NIST) AI Risk Management Framework (AI RMF) at https://www.nist.gov/itl/ai-risk-management-framework MEASURE function explicitly anchors evaluation to deployment context, not to model isolation.
The third is adversary specificity. The vendor’s red team probes for general failure modes. The deployer’s adversaries — fraudsters, social engineers, prompt injectors, regulatory probers, journalists — have specific motivations the vendor cannot anticipate. The European Union (EU) AI Act, accessible at https://artificialintelligenceact.eu/, requires deployers of high-risk systems to perform their own monitoring under Article 26; vendor evaluation does not substitute.
The Eight Test Categories
A defensible vendor-model red team covers eight categories, each with its own protocol, evidence type, and acceptance threshold.
1. Domain Accuracy
How does the model perform on the deployer’s actual workload? The test set is curated from real (suitably anonymised) examples. Acceptance is defined relative to a clearly stated baseline — current process, alternative model, or human performance.
2. Refusal Behaviour
Does the model refuse to answer questions it should answer, and answer questions it should refuse? A deployer-specific refusal taxonomy captures the boundary between acceptable and unacceptable. The Stanford Foundation Model Transparency Index at https://crfm.stanford.edu/fmti/ documents that refusal behaviour varies enormously across providers and across versions of the same provider.
3. Harmful-Output Tests
Toxicity, defamation, dangerous instructions, and self-harm content. Standardised harmful-content corpora exist; the deployer adds use-case-specific harms (false medical advice for a healthcare deployment, false legal advice for a legal deployment, false financial recommendations for a financial deployment).
4. Bias and Fairness
Demographic disparities in accuracy, refusal, sentiment, and recommendation. NIST AI RMF MEASURE-2.11 anchors the requirement; the deployer constructs probes that exercise the protected attributes relevant to the use case and the applicable legal regime.
5. Prompt Injection and Jailbreak Resistance
Direct and indirect prompt injection through user input, retrieved documents, tool outputs, and system messages. The Cloud Security Alliance at https://cloudsecurityalliance.org/ has published prompt-injection threat-model materials that anchor a structured probe set.
6. Data Exfiltration and Privacy
Will the model emit personally identifiable information from its training data? From the prompt or retrieval context? From other tenants’ data through cross-tenant attacks? Probes use known canary strings and structured extraction techniques.
7. Tool, Agent, and Function-Call Misuse
For models that invoke tools, can the model be induced to take actions outside policy? Can it be manipulated into producing function calls with elevated privileges, into making external network calls, or into recursive agent invocations?
8. Operational Drift Susceptibility
Does the model’s behaviour change between identical prompts run minutes, hours, or days apart? If so, by how much, and is the variation within the deployer’s tolerance? Probes are run repeatedly across time to measure drift.
Standards That Anchor the Practice
Three normative anchors define what defensible red teaming looks like.
The U.S. NIST AI Risk Management Framework at https://www.nist.gov/itl/ai-risk-management-framework defines the GOVERN, MAP, MEASURE, and MANAGE functions. Red teaming sits in MEASURE-2 (evaluating risks) and MANAGE-2.3 (responding to identified risks). GOVERN-6 establishes the third-party governance umbrella.
The International Organization for Standardization / International Electrotechnical Commission (ISO/IEC) 42001:2023 standard at https://www.iso.org/standard/81230.html includes Annex A controls that require evaluation of AI systems against intended use, including pre-deployment testing of procured components.
The EU AI Act Article 55 requires General-Purpose AI providers of systemic-risk models to perform “state-of-the-art” model evaluations including adversarial testing. The deployer benefits from this upstream work but, under Article 25, must still discharge its own deployer-side evaluation obligations. Article 73 incident-notification timelines apply to the post-production residue of inadequate pre-production testing.
The U.S. Cybersecurity and Infrastructure Security Agency (CISA) Software Bill of Materials programme at https://www.cisa.gov/sbom and the Supply-chain Levels for Software Artifacts (SLSA) framework at https://slsa.dev/ together require that the model artefact tested is the same artefact deployed — a non-trivial requirement when models are served through APIs that may serve different versions to different callers.
Operational Realities
Pre-production red teaming has three structural challenges.
The first is access. Some vendors restrict access to evaluation tooling. Some rate-limit aggressive testing. Some forbid certain probe categories under terms of service. Establishing red-team access rights belongs in the contract (Article 4 of this module).
The second is reproducibility. Models are often non-deterministic. Probes must be run many times per condition; results must be summarised statistically rather than reported as single trials.
The third is scope creep into systems testing. Red teaming should focus on the model layer; system-level red teaming (the integration, the prompts, the retrieval pipeline, the tools) is a separate exercise that consumes the vendor red-team output as input. Mixing the two produces results that cannot be acted upon.
Maturity Indicators
| Maturity | What vendor-model red teaming looks like |
|---|---|
| Foundational (1) | Vendor models are accepted on the basis of vendor-supplied evidence; no deployer-side testing occurs. |
| Developing (2) | Ad hoc testing is performed for high-profile deployments; coverage and methodology are inconsistent. |
| Defined (3) | All eight categories are exercised for every model above the standard tier; results are scored and gate production approval. |
| Advanced (4) | Red teaming runs in continuous-integration pipelines; model updates trigger automatic re-testing; deviations halt deployment. |
| Transformational (5) | The organization contributes red-team probe sets to industry consortia and influences vendor pre-release evaluation practice. |
Practical Application
A logistics company evaluating two vendor models for an automated customer-communication assistant should not select the higher-scoring public-benchmark model by default. It should construct a 500-to-1000-example evaluation set covering the eight categories with use-case-specific probes, run the set against both models under representative system prompts, score the results, and produce a pre-production decision memo. The memo records what was tested, what was found, what residual risk was accepted, and who accepted it. Where one model wins on accuracy but loses on jailbreak resistance, the trade-off is named explicitly and approved at the appropriate authority. The artefact that justifies production is the memo, not the vendor’s marketing.
The next article (Article 10) addresses the post-production complement to pre-production red teaming: continuous monitoring of vendor model behaviour as the model, the world, and the use of the system evolve.