Datasheets for Datasets: Provenance and Quality

FlowRidge

This article presents the canonical datasheet structure, the evidentiary value of completed datasheets, and the operational practices that make datasheets a living artefact rather than a one-time documentation deliverable.

Why Datasheets Matter Independently of Models

Three properties make the datasheet useful independent of any specific model.

First, upstream reuse. A given dataset typically supports many models over its life. Documenting the dataset once, comprehensively, prevents the same investigation from being re-run by every team that touches it.

Second, bias and representation reasoning. Most fairness problems originate in the data. A datasheet that documents which populations are represented, which are under-represented, and which are absent altogether allows downstream consumers to reason about fitness before training begins, not after deployment fails.

Third, legal and regulatory defensibility. The General Data Protection Regulation (GDPR) Article 30 requires controllers to maintain records of processing activities. Sectoral rules — the U.S. Equal Credit Opportunity Act in lending, the U.S. Fair Housing Act in real estate — require demonstrable evidence that data used in automated decisions did not encode prohibited biases. The datasheet is the natural home for that evidence.

The European Union AI Act Annex IV at https://artificialintelligenceact.eu/annex/4/ explicitly requires high-risk AI providers to document the datasets used, including their composition and provenance — a requirement that maps directly to the datasheet structure.

The Canonical Structure

The Gebru et al. proposal organises datasheet questions into seven sections. Mature programs adapt the questions to local context but preserve the structure.

1. Motivation

Why was the dataset created? Who created it? Who funded it? The motivation section establishes the purposes the dataset was designed for, distinguishing those from the purposes it is now being used for. Mismatch between original purpose and current use is one of the most reliable predictors of dataset misuse.

2. Composition

What is each instance? How many instances are there? Does the dataset contain all possible instances, or a sample? If a sample, how was the sampling done? What labels or targets does each instance carry, and how were they produced (human annotation, automated extraction, derivation from other fields)? Are there explicit subgroups, and how are they distributed? Are there missing data, and is the missingness random or systematic?

The composition section is where the bulk of documentation effort lives. The Algorithmic Justice League has shown through its Gender Shades and follow-up research at https://www.ajl.org/gender-shades how composition imbalances propagate into model performance disparities — and how composition documentation could have surfaced the issue at design time.

3. Collection Process

How was the data gathered? Was it observational (collected from existing systems), elicited (surveys, interviews), or generated (synthetic, simulated)? Over what time period? In what geographies? What quality controls applied during collection?

For data sourced from third parties, the section should document the chain — the immediate source, the upstream source, and any intermediaries. The Data Provenance Initiative has published research at https://www.dataprovenance.org/ on the limited transparency of training data for popular foundation models, illustrating what disclosure looks like when it is done thoroughly.

4. Preprocessing, Cleaning, Labelling

What transformations have been applied to the raw data on its way into the documented dataset? Are the raw data also retained, and accessible? Was labelling done by humans, by other models, or by automated rules? What was the inter-rater agreement among labellers? Were specific records removed, and if so, why?

The section is critical because preprocessing decisions are often where invisible bias enters. A “balanced” dataset might be balanced by demographic group but unbalanced by the prevalence of edge cases within each group.

5. Uses

What purposes has the dataset already been used for? What purposes might it reasonably be used for? What purposes should it not be used for? The “should not” subsection requires honesty: it asks the dataset creator to imagine the misuses they can foresee and warn downstream users explicitly.

6. Distribution

Will the dataset be distributed externally? Under what licence? Are there third-party rights (intellectual property, privacy interests of subjects) that affect redistribution? Is there a recommended citation format?

The licence question is often more complex than it appears. Many image datasets aggregate content with mixed licences; many text datasets include material whose copyright status is contested. The Linux Foundation’s SPDX licence identifier list at https://spdx.org/licenses/ provides standardised identifiers that downstream consumers can reason about.

7. Maintenance

Who maintains the dataset? Is there a planned update cadence? How are errors reported and corrected? Will the dataset be retained indefinitely, or retired? The maintenance section is what distinguishes a snapshot from an asset.

Variations for Specific Contexts

Foundation-model training corpora require expanded sections on the web-crawl methodology, the deduplication strategy, the safety filtering applied, and the inclusion or exclusion of specific source domains. The C4 dataset documentation and the RedPajama dataset documentation are useful templates for organisations building their own foundation-model training pipelines.

Medical and clinical datasets require sections on Institutional Review Board (IRB) approval, patient consent provenance, de-identification methodology, and any clinical conditions or sites that are over- or under-represented. The Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule de-identification guidance from the U.S. Department of Health and Human Services at https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification/index.html shapes what the datasheet must cover.

Sensitive demographic datasets require explicit treatment of how protected attributes were captured (self-reported, inferred, third-party tagged) and how the data steward handles requests to update or correct them.

Synthetic datasets require an additional section linking back to the seed data, the generator, the validation results, and any privacy parameters — as discussed in the synthetic data article in Module 1.22.

Operational Practices

A datasheet, like a model card, gains its value from being kept current and accessible.

Datasheet-as-code. The datasheet should live in version control alongside the dataset definition (the dbt model, the Spark job, the data contract). Updates to the dataset propose updates to the datasheet in the same pull request.

Catalogue integration. The data catalogue (Module 1.22) should expose the datasheet as a first-class field on every governed dataset.

Pre-training gate. Programs with mature governance require datasheet sign-off before any new dataset is used to train a production-bound model.

Subject right integration. The datasheet should reference the operational mechanism for handling data subject access, correction, and erasure requests, so that downstream users can confirm the dataset they consume is compatible with the obligations they take on.

Public-facing variant. For widely-shared datasets, a public datasheet variant communicates the relevant information without exposing operational secrets.

Common Failure Modes

The first is narrative thin-ness — sections completed with a single sentence that satisfies the form but communicates nothing. Counter with templates that include example sentences and minimum-length expectations.

The second is outdated composition — the dataset has been growing for two years but the composition section reflects the original snapshot. Counter with automated metrics that surface composition drift and require datasheet updates when drift exceeds defined thresholds.

The third is legal-only motivation — the motivation section reads like a contract recital. Counter by requiring the answer to “what real-world question does this dataset help answer?” in plain language.

The fourth is missing maintenance — the dataset has no named owner. Counter with quarterly ownership re-attestation, the same discipline applied to lineage in Module 1.22.

Looking Forward

The next article in Module 1.23 examines knowledge management — the broader infrastructure that holds model cards, datasheets, decision records, and other documented artefacts together. Documentation that exists but cannot be found is documentation that does not exist.