Model Cards: A Standard for AI Documentation

FlowRidge

This article describes the model card concept, the canonical structure, the variations that have emerged for specific contexts (foundation models, medical AI, large language models), and the practices that determine whether the cards actually get used.

Origins and Adoption

The model card concept was introduced in the 2019 paper Model Cards for Model Reporting by Mitchell et al. at Google, available at https://arxiv.org/abs/1810.03993. The paper proposed a structured documentation format that addresses the gap left by traditional model documentation, which tended toward either pure technical specifications (incomprehensible to outsiders) or marketing material (insufficient for technical decisions).

Adoption has been broad and rapid. Google Cloud’s Model Cards Toolkit, the Hugging Face Hub model card template at https://huggingface.co/docs/hub/model-cards, IBM’s AI FactSheets, and Microsoft’s Responsible AI Toolbox have all converged on substantially similar structures. The European Union AI Act Article 11 on technical documentation at https://artificialintelligenceact.eu/article/11/ and Annex IV codify documentation requirements for high-risk systems that map closely to the model card structure.

The Canonical Structure

A complete model card answers nine questions.

1. Model Details

The card opens with identity: the model name, the version, the date, the organisation responsible, the type (classifier, regressor, sequence-to-sequence, language model, vision encoder), and the contact for questions or issues.

2. Intended Use

This section names the scenarios the model was designed for, explicitly distinguishes them from out-of-scope uses, and identifies the primary user populations. The section should be specific: “fraud detection on retail credit card transactions in the United States” rather than “fraud detection.” Specificity here prevents the most common misuse pattern — applying a model to a population it was not designed for.

3. Factors

The factors section names the relevant subgroups, conditions, and instrumentation that influence model behaviour. For a face-recognition model: skin tone, age, gender, lighting conditions, camera type. For a credit decision model: protected demographic groups, income bands, geographies. The factors should match the subgroups the evaluation will report on; without this alignment the card cannot demonstrate fitness for purpose across populations.

4. Metrics

Performance metrics with explicit definitions. For a binary classifier: precision, recall, F1, AUC, calibration. For a generative model: BLEU, ROUGE, perplexity, human-rated quality. Crucially, metrics should be reported on the subgroups identified in the factors section, not just on the aggregate population. The Algorithmic Justice League’s library of fairness measurement work at https://www.ajl.org/ informs which subgroup breakdowns matter most.

5. Evaluation Data

The datasets used for evaluation, their composition, their source, and any relevant preprocessing. This is the section that allows an external reviewer to judge whether the evaluation was conducted on a representative sample. Datasheets for Datasets (covered in the next article) provide the supporting structure.

6. Training Data

The datasets used for training, their composition, their source, and any preprocessing or sampling decisions. For privacy-sensitive contexts the section may need to abstract specific records while still describing the population. The Stanford Center for Research on Foundation Models has shown through the Foundation Model Transparency Index at https://crfm.stanford.edu/fmti/ how much variance exists between providers in training data disclosure quality.

7. Quantitative Analyses

The hard numbers: confusion matrices, calibration plots, fairness metrics, error analysis by subgroup. This section should support the claims made in earlier sections rather than restating them in narrative form.

8. Ethical Considerations

The section names the ways the model could cause harm if misused or if it underperforms. It distinguishes between known risks and risks the developers consider plausible but have not measured. The section should also identify any populations the model has been observed to underserve, even if the underservice is judged acceptable for the deployment context.

9. Caveats and Recommendations

The closing section lists known limitations, conditions under which the model should not be used, and recommendations for downstream developers and operators (for example, “always combine with human review for decisions over a $10,000 transaction value”).

Variations for Specific Contexts

Foundation models require larger and more elaborate cards because the model is intended for a wide range of unspecified downstream uses. The Hugging Face card structure includes additional sections on environmental impact, computational requirements, and safe-use guidance. The Llama 3 model card at https://huggingface.co/meta-llama/Meta-Llama-3-8B/blob/main/MODEL_CARD.md is an industry reference point.

Medical AI models require additional sections on clinical context, patient population characteristics, and the specific clinical workflow integration. The U.S. Food and Drug Administration draft guidance on Predetermined Change Control Plans for Machine Learning-enabled Device Software Functions at https://www.fda.gov/regulatory-information/search-fda-guidance-documents/predetermined-change-control-plans-machine-learning-enabled-medical-devices indicates expectations consistent with model-card-style documentation.

Large language models typically include sections on prompt sensitivity, safety alignment methods, jailbreak resistance testing, and known failure modes such as hallucination rates by topic.

Reinforcement learning agents require sections on reward function definition, policy stability, safe-exploration boundaries, and the environments in which the agent has and has not been validated.

Operational Practices

A model card that lives in a wiki and is updated once is a marketing document. A card that lives in version control alongside the model code and updates with every model release is a governance artefact.

Version-locked cards. Each model version should produce a corresponding card version. Tools such as the Hugging Face Hub auto-generate the card from training metadata and update it on each release.

Canonical location. The card should live in a known place — typically the model registry — that is referenced in the audit trail (Module 1.21) and discoverable through the data catalogue (Module 1.22).

Pre-deployment gate. The card must pass review before the model is permitted to deploy. The review checks completeness, factual accuracy against training and evaluation logs, and sufficient subgroup analysis.

Customer-facing variant. For models exposed externally, a customer-facing card distilled from the internal card communicates the relevant information without exposing operational secrets.

Multilingual versions. Where the model is deployed across markets, the card may need to be available in the relevant languages. The card is part of the meaningful information about automated decision-making that the General Data Protection Regulation (GDPR) Article 22 requires.

Common Failure Modes

The first is card decay — the card was written for the first model version and never updated. Counter by tying card updates to the model release pipeline.

The second is aspirational metrics — the card reports performance on the development distribution rather than the deployment distribution. Counter by requiring evaluation data that resembles deployment data.

The third is hidden subgroup gaps — overall performance is strong but performance for specific subgroups is materially weaker. Counter by mandatory subgroup reporting and by review of any subgroup where performance falls below an absolute floor.

The fourth is lawyer-driven minimisation — the card includes only the legally-required disclosures and avoids any voluntary transparency. Counter by treating cards as product documentation, not legal exhibits.

Looking Forward

The next article in Module 1.23 turns to datasheets for datasets — the upstream cousin of the model card that documents the data itself with comparable structure.