Audit Trails for AI Decisions

FlowRidge

This article explains what belongs in an AI decision audit trail, the storage and retention patterns that make it tamper-evident, and the integration points that make it useful in production.

What Distinguishes an AI Audit Trail

Traditional Information Technology (IT) audit logs record events: a user logged in, a record was updated. AI audit trails must record reasoning: what input arrived, what context was retrieved, what prompt was constructed, what model was invoked, what output was produced, what confidence was attached, what downstream decision followed.

The European Union AI Act Article 12 at https://artificialintelligenceact.eu/article/12/ codifies this expectation explicitly for high-risk AI systems: providers must ensure automatic logging of events sufficient to identify situations giving rise to risk, monitor operation, and enable post-market monitoring. The article also requires that logs be kept for a defined period.

The Office of the Comptroller of the Currency Bulletin 2021-39 at https://www.occ.gov/news-issuances/bulletins/2021/bulletin-2021-39.html applies a similar standard to AI in banking: model decisions must be reproducible from the logged context.

What Belongs in the Trail

A defensible AI decision audit trail captures the following per decision:

Decision identifier — a unique, immutable identifier that travels with the decision through downstream systems.
Timestamp with timezone, ideally to millisecond precision.
Subject identifier — the person, account, transaction, or asset the decision concerns.
Inputs — the raw inputs to the decision, including upstream system data and any retrieval results.
Model identifier and version — the exact model artefact, including any fine-tuning lineage. The Model Card framework at https://modelcards.withgoogle.com/about provides a useful schema.
Configuration and parameters — temperature, top-p, max tokens, retrieval thresholds.
Prompt and context for Generative AI — the full prompt as constructed.
Output — the model’s raw response, plus any post-processing.
Confidence and uncertainty signals — model-reported probability, ensemble disagreement.
Downstream action — the business decision taken on the basis of the output.
Human review state — whether a human reviewed, approved, or overrode the decision.
Outcome — when ultimately observable, the actual outcome for retrospective evaluation.

Storage and Tamper-Evidence

Audit trails are only as credible as the assurance that they have not been altered. Three patterns dominate.

The first is append-only logging to a write-once-read-many (WORM) store. Cloud providers offer this natively: Amazon Web Services S3 Object Lock, Azure Blob Storage immutable storage, and Google Cloud Storage Bucket Lock at https://cloud.google.com/storage/docs/bucket-lock all provide retention enforcement.

The second is cryptographic chaining — each log entry includes a hash of the prior entry, creating a chain that exposes any post-hoc deletion or modification. The Linux Foundation’s sigstore project at https://www.sigstore.dev/ implements this pattern.

The third is independent witness — periodically publishing a Merkle root of recent log entries to an external system so that any later tampering would be detectable.

Mature programs combine all three.

Retention

Retention periods should be set by data type and regulation. The EU AI Act sets a minimum six-month retention for high-risk system logs. The Health Insurance Portability and Accountability Act (HIPAA) in the United States can require six years. The Basel Committee on Banking Supervision principles for risk data aggregation (BCBS 239) at https://www.bis.org/publ/bcbs239.htm imply multi-year retention for credit and risk decisions.

Performance and Cost

Decision-level logging at scale generates large volumes. A system processing 10 million decisions per day with 5 KB of context per decision produces 50 GB per day, or 18 TB per year.

Common patterns include tiered storage (hot for recent, cold for older), selective sampling (full logging for high-risk, statistical sampling for low-risk), reference logging (storing pointers rather than embedded content), and compression with columnar formats.

The OpenTelemetry specification at https://opentelemetry.io/docs/specs/otel/ provides patterns for combining traces, metrics, and logs that translate well to AI audit trails.

Integration With Investigations

Investigative queries fall into three classes:

Single-subject reconstruction: reproduce every AI decision affecting a specific customer over a defined window.
Pattern detection: identify decisions sharing unusual characteristics — high confidence with bad outcome.
Cohort analysis: compare decisions across protected characteristics, geographic regions, or model versions.

Mature programs invest in unified observability platforms — Splunk, Datadog, Grafana with Loki, or custom data warehouses — that combine audit trails with model performance metrics.

Privacy and Access Control

The audit trail itself is sensitive data. Common controls include just-in-time access for investigators, field-level encryption for PII, audit-of-the-audit-trail (every access is logged), and automated redaction in lower-trust environments.

The European Data Protection Board guidance on data protection by design and by default at https://edpb.europa.eu/our-work-tools/our-documents/guidelines/guidelines-42019-article-25-data-protection-design_en applies directly.

Common Failure Modes

The first is log fatigue: capturing so much data that no one reviews it. Counter with sampling-based proactive reviews and clear runbooks.

The second is clock skew: timestamps from different systems disagree, making sequence reconstruction impossible. Counter with synchronised time sources.

The third is partial logging: capturing the model output but not the prompt, or the prompt but not the retrieved context. Counter with mandatory schema validation.

The fourth is vendor opacity: third-party AI services that do not expose the logging hooks needed. Counter through procurement: require log export commitments in vendor contracts.

Looking Forward

A robust audit trail closes the loop opened by the heat map, risk acceptance, and exception management workflows. The next module turns to data lineage and provenance — the upstream cousin of decision-level logging.