The Carbon Footprint of AI: Training, Inference, and Hidden Cost Drivers

FlowRidge

Definition

The carbon footprint of an Artificial Intelligence (AI) system is the cumulative greenhouse-gas (GHG) emission attributable to the system across its lifecycle — training, inference, retraining, supporting data movement, and the embodied carbon of the underlying hardware. The phrase is deceptively simple. The measurement is not. An organization that wants to satisfy the European Union (EU) Corporate Sustainability Reporting Directive (CSRD), the EU AI Act’s Article 95 voluntary code of conduct on sustainability, or the customer questionnaires that increasingly require AI environmental disclosure, must understand the four major emission categories that AI workloads produce — and the hidden cost drivers that make each category larger than the headline numbers suggest.

This article opens Module 1.9 by establishing that vocabulary. It defines the four emission categories, surveys the orders of magnitude involved, and identifies the cost drivers that frequently surprise the program lead who is doing the accounting for the first time.

The four emission categories

An AI system generates emissions in four categories.

Training emissions are the energy consumed by the accelerator cluster during the training run, multiplied by the grid emission factor at the data center where the run occurred. Training is the most visible category because it is concentrated in time — a few weeks of intense compute that produces a single, measurable energy draw on the cluster’s power meters. The Schwartz et al. “Green AI” paper in Communications of the ACM established that a single large-language-model training run could consume the equivalent of multiple cars’ lifetime emissions, and that the trend toward larger models was producing super-linear growth in training emissions.¹

Inference emissions are the energy consumed by the accelerator cluster (or in some cases, the CPU cluster) during every prediction, classification, or generation that the deployed model serves. Inference is the most underestimated category because it is distributed in time — millions of small queries per day, each of which is individually small but which collectively exceed the training emissions within months for any high-traffic system. McKinsey’s State of AI survey has documented the order-of-magnitude growth in enterprise inference workloads as generative AI moved from pilot to production.²

Data and infrastructure emissions are the energy consumed by the data preparation pipelines, the storage systems that hold training and inference data, the networking that moves data into and out of the cluster, and the cooling and power-conversion overhead that the data center adds to every kilowatt-hour delivered to the accelerators. The Power Usage Effectiveness (PUE) ratio of a data center — typically between 1.1 and 1.6 — is the multiplier that turns IT-equipment energy into facility-energy and then into emissions.

Embodied emissions are the emissions produced during the manufacturing, transport, and end-of-life processing of the accelerator hardware, the servers, the networking equipment, and the data center facility itself. Embodied carbon is amortized over the operating life of the equipment but is increasingly recognized as a material category as the operational emissions decline through renewable-energy procurement.

The orders of magnitude

The absolute numbers matter because they determine which category an organization should measure first.

For a frontier-scale training run — a model with hundreds of billions of parameters trained on trillions of tokens — the training-emissions figure is in the hundreds to low thousands of metric tons of carbon dioxide equivalent (tCO2e). For a mid-scale model — tens of billions of parameters, hundreds of billions of tokens — the figure is in the tens of tCO2e. For a fine-tuning run on a pre-trained model, the figure is typically two or three orders of magnitude smaller than the original pre-training run.

For inference, the per-query emissions are typically measured in grams of CO2e — but a generative-AI service handling tens of millions of queries per day produces hundreds of tCO2e per year, which exceeds the training emissions of the underlying model within the first year of deployment. The International Energy Agency (IEA) Electricity 2024 report projected that data-center electricity consumption — driven heavily by AI workloads — would more than double between 2022 and 2026, with AI inference being the dominant driver of the growth.³

For embodied carbon, the per-accelerator manufacturing footprint is in the hundreds of kilograms of CO2e for a high-end Graphics Processing Unit (GPU), but a single training cluster of ten thousand GPUs carries embodied emissions in the thousands of tCO2e — comparable to the operational emissions of the same cluster over several years.

The hidden cost drivers

The headline categories conceal a set of cost drivers that frequently surprise the practitioner.

The first hidden driver is idle and warm-pool capacity. The accelerator cluster that serves inference cannot be sized only for the average load; it must be sized for the peak. The provisioned-but-idle capacity consumes a fraction of the peak energy even when no queries are being served, and the cumulative idle-energy emissions can be 30% to 50% of the active-inference emissions for a typical service.

The second hidden driver is data movement. Moving terabytes of training data from object storage into the accelerator cluster, and moving inference responses across continents to satisfy data-residency requirements, consumes meaningful energy in the network fabric and at the cross-region transit points. The Green Software Foundation has documented that data-movement energy is frequently 5% to 15% of compute energy for distributed AI workloads.⁴

The third hidden driver is failed and re-run experiments. The training run that produces the production model is the visible one; the dozens of failed runs that preceded it are not. A model-development program that does not track the cumulative training emissions of all runs — successful and failed — is systematically under-counting.

The fourth hidden driver is retraining cadence. Models that are retrained weekly or monthly on rolling data windows accumulate training emissions at a rate that quickly exceeds the single, original training run. Retraining cadence is a governance choice, not a technical given.

The fifth hidden driver is the embodied carbon refresh cycle. Accelerator hardware is replaced every three to five years. A program that increases its accelerator footprint by 50% per year is producing embodied-carbon emissions at a rate that the operational-emissions accounting does not capture.

Maturity Indicators

A foundational practitioner reading the COMPEL D19 maturity rubric will recognize that the first two levels of maturity correspond directly to whether the organization has identified and measured these four categories.⁵ At Level 1 (Foundational), no measurement of energy consumed by AI training or inference exists. At Level 2 (Developing), at least the largest training runs are measured and a first carbon-footprint estimate has been produced. At Level 3 (Defined), all production AI systems have per-system energy and CO2e metrics tracked continuously. The threshold for Level 3 is the moment at which the organization stops doing one-off audits and starts doing continuous measurement — typically by integrating carbon-tracking instrumentation directly into the Machine Learning Operations (MLOps) platform.

The Stanford Foundation Model Transparency Index (FMTI) has begun to measure providers’ disclosure of training-compute, training-energy, and training-emissions figures. The FMTI compute-layer scores are publicly published and have become a de-facto benchmark that procurement teams use when comparing foundation-model vendors.⁶ An organization that is publishing its own AI carbon-footprint figures with comparable transparency is already demonstrating Level 4 (Advanced) maturity.

Practical Application

A foundational practitioner who is asked to scope an AI carbon-accounting program for the first time should produce four artifacts.

First, a system inventory that lists every production AI system, every training pipeline, and every fine-tuning workflow currently in operation. The inventory is the denominator against which all subsequent measurement will be reported.

Second, a measurement plan that identifies, for each system in the inventory, which of the four categories will be measured, what instrumentation will produce the measurement, and what cadence the measurement will be reported on. The Greenhouse Gas Protocol Scope 2 and Scope 3 categories provide the accounting boundary that the measurement plan must satisfy.⁷

Third, a first-cut estimate that uses the cloud provider’s sustainability dashboard, the public emission factors of the relevant grids, and conservative assumptions about idle and warm-pool capacity to produce a baseline carbon-footprint figure. The first-cut estimate will be wrong — typically under-counting by a factor of two — but it establishes the order of magnitude that the program is dealing with.

Fourth, a gap analysis that identifies which of the four categories the organization currently has the weakest visibility into, and what investment is required to bring measurement of that category to parity with the others. The gap analysis is the input to the Year-1 measurement-program roadmap.

The Organisation for Economic Co-operation and Development (OECD) AI Principles include sustainability as a value-based principle that AI actors should respect across the AI lifecycle, providing a high-level framing that the carbon-accounting program operationalizes.⁸

Summary

The carbon footprint of AI spans training emissions, inference emissions, data and infrastructure emissions, and embodied emissions. The orders of magnitude are large — frontier training runs in hundreds to thousands of tCO2e; high-traffic inference services in hundreds of tCO2e per year; per-cluster embodied carbon in thousands of tCO2e amortized over a refresh cycle. The hidden cost drivers — idle capacity, data movement, failed runs, retraining cadence, embodied refresh — frequently double the headline figures. The COMPEL D19 maturity rubric uses the existence and continuity of measurement across these categories as the gating criterion between Levels 1, 2, and 3. The foundational practitioner builds the inventory, the measurement plan, the first-cut estimate, and the gap analysis as the four artifacts that bootstrap the program. The next article in this module, M1.9Measuring AI Energy Use: Methodologies, Tools, and Reporting Standards, develops the energy-measurement layer that the carbon-footprint accounting depends on.

Schwartz, R., Dodge, J., Smith, N. A., and Etzioni, O. “Green AI.” Communications of the ACM, December 2020. https://cacm.acm.org/research/green-ai/ — accessed 2026-04-26. ↩
McKinsey & Company, “The state of AI.” McKinsey Global Survey. https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai — accessed 2026-04-26. ↩
International Energy Agency, “Electricity 2024.” IEA, January 2024. https://www.iea.org/reports/electricity-2024 — accessed 2026-04-26. ↩
Green Software Foundation, “Software Carbon Intensity Specification.” https://greensoftware.foundation/ — accessed 2026-04-26. ↩
COMPEL Domain D19 (AI Environmental Sustainability) maturity rubric, Levels 1 through 5. See shared/data/compelDomains.ts. ↩
Stanford Center for Research on Foundation Models (CRFM), “Foundation Model Transparency Index.” Stanford HAI. https://crfm.stanford.edu/fmti/ — accessed 2026-04-26. ↩
Greenhouse Gas Protocol, “Corporate Standard” and “Scope 3 Standard.” World Resources Institute and World Business Council for Sustainable Development. https://ghgprotocol.org/ — accessed 2026-04-26. ↩
Organisation for Economic Co-operation and Development, “OECD AI Principles.” https://oecd.ai/en/ai-principles — accessed 2026-04-26. ↩