Hardware Efficiency: TPUs, NPUs, and Custom Silicon for AI

FlowRidge

Definition

Hardware efficiency is the energy consumed per useful Artificial Intelligence (AI) operation — typically expressed in joules per inference, joules per token, joules per training step, or operations per watt. The choice of hardware accelerator is one of the largest determinants of hardware efficiency. The principal accelerator categories are general-purpose Graphics Processing Units (GPUs), Tensor Processing Units (TPUs) and similar AI-purpose-built dataflow processors, Neural Processing Units (NPUs) for edge and mobile deployment, and custom application-specific integrated circuits (ASICs) designed for narrower workload classes. Each category trades flexibility against efficiency in a way that the foundational practitioner must understand to make informed procurement and architecture decisions.

This article surveys the categories, the energy-efficiency trade-offs, and the maturity-program implications.

The accelerator categories

General-purpose GPUs are the dominant training hardware and the most flexible deployment platform. Modern AI-class GPUs (NVIDIA H100, H200, B200; AMD MI300; comparable parts from other vendors) include specialized matrix-multiply units alongside general-purpose compute, support a wide range of model architectures and precisions, and have mature software ecosystems. The flexibility comes at an efficiency cost — a general-purpose GPU running a specific workload typically consumes 2-5x the energy per useful operation of a purpose-built accelerator running the same workload, because the unused hardware components still consume idle power and the dataflow paths are not optimized for the specific workload pattern.

Tensor Processing Units (Google’s TPUs, and the broader category of dataflow-optimized AI processors from other vendors) are designed specifically for the matrix-multiply-and-accumulate patterns that dominate transformer-based AI workloads. They sacrifice general-purpose compute capability for higher operations-per-watt on the workloads they are optimized for. For Google’s first-party AI services, TPUs are typically the production-serving hardware and report meaningfully better energy efficiency than equivalent GPU deployments.

Neural Processing Units are the AI accelerators integrated into mobile, edge, and embedded devices. They are typically optimized for low-power inference (milliwatts to watts, versus the hundreds of watts of data-center accelerators) and produce dramatic efficiency gains for edge AI workloads that can run on the device rather than being sent to the data center. Edge AI deployment is therefore a sustainability lever in itself — moving inference from data-center GPUs to on-device NPUs can produce 10-100x energy reductions for the per-query inference, although it shifts some energy cost to the device’s battery.

Custom ASICs are designed for narrower workload classes — specific model architectures, specific token-throughput targets, or specific quantization schemes. Examples include the inference-serving ASICs from several startup vendors (Groq, Cerebras, SambaNova, and others) and the in-house silicon from the largest cloud providers (AWS Trainium and Inferentia, Google TPU, Microsoft Maia). For the workloads they are optimized for, custom ASICs can deliver order-of-magnitude efficiency gains over general-purpose GPUs. The trade-off is that they support a narrower range of models and require workload-specific software porting.

The hardware-efficiency-per-generation trend

Hardware efficiency improves with each generation, typically delivering 2-4x improvement in operations-per-watt every two to three years across the major accelerator families. The aggregate effect is dramatic: a 2024-vintage AI cluster delivers 5-10x the operations-per-watt of a 2018-vintage cluster. The hardware refresh cycle is therefore itself a sustainability lever — a cluster that is past its efficiency-equivalent-to-new-hardware threshold is consuming more energy per useful operation than necessary.

The hardware-efficiency-per-generation trend has been documented across multiple accelerator families and is expected to continue for at least the next several years as process-node improvements and architectural innovations compound. The IEA Electricity 2024 report incorporates this trend into its data-center electricity-growth projections, distinguishing between the gross growth in AI compute demand and the net growth after hardware-efficiency improvements.¹

The embodied-carbon trade-off

A faster hardware refresh cycle improves operational efficiency but increases embodied-carbon emissions per unit of useful work. The embodied carbon of manufacturing a high-end accelerator is in the hundreds of kilograms of CO2 equivalent; amortized over a five-year operating life, the per-year embodied figure is much smaller than the operational figure for a high-utilization accelerator but becomes comparable for a low-utilization accelerator. The optimal refresh cycle depends on the workload utilization, the operational-efficiency improvement of the new generation, and the embodied-carbon delta. Article 10 of this module develops the lifecycle-assessment framing in detail.

Procurement and disclosure

The Stanford Foundation Model Transparency Index (FMTI) compute-layer scores have begun to recognize disclosure of the hardware used for training and serving as a transparency indicator, which is making the hardware-efficiency layer increasingly procurement-relevant.² An enterprise AI program that wants to defend its sustainability claims should be able to disclose the hardware mix — share of compute on each accelerator family, hardware-vintage profile, and operations-per-watt achieved — alongside its energy and emissions figures.

The McKinsey State of AI surveys have documented that the most sustainability-mature enterprise AI programs are increasingly engaging with their cloud providers on the hardware-efficiency dimension, requesting access to the most efficient accelerator types for their workloads and incorporating hardware-vintage-mix disclosure into their cloud-procurement contracts.³

Maturity Indicators

The COMPEL D19 maturity rubric specifies that at Level 3 (Defined), “model efficiency metrics (e.g., performance per watt) are included in model cards” — the metric is hardware-dependent and requires the practitioner to know the hardware on which the figure was measured.⁴ At Level 4 (Advanced), “model efficiency optimization (distillation, pruning, quantization) is standard practice” — and the optimization is hardware-aware (a model quantized to INT8 produces savings only on hardware that supports INT8 natively). The hardware-efficiency layer is therefore an implicit prerequisite at Level 3 and an explicit one at Level 4.

The Green Software Foundation has documented that the per-operation energy delta between accelerator generations is among the largest sustainability-improvement levers available to practitioners who do not control the model architecture itself.⁵

Practical Application

A foundational practitioner who is engaging with the hardware-efficiency question should produce four artifacts.

Artifact 1: the hardware-mix inventory. An inventory that, for each AI workload, records the accelerator family, the accelerator generation, the operations-per-watt figure, and the workload’s measured energy-per-useful-operation. The inventory is the input to procurement and refresh-decision conversations.

Artifact 2: the per-workload hardware-efficiency dashboard. A dashboard that displays the operations-per-watt of each workload alongside the operations-per-watt of the most efficient available alternative for that workload. Workloads with large efficiency gaps become candidates for hardware migration or workload re-engineering.

Artifact 3: the hardware-refresh decision criteria. The criteria the organization applies when deciding whether to refresh hardware — incorporating the operational-efficiency improvement, the embodied-carbon delta, the workload utilization, and the financial cost. The criteria explicitly recognize that the optimal refresh cadence is a sustainability decision, not just a financial one.

Artifact 4: the hardware-disclosure narrative. A written disclosure that explains the hardware mix, the operations-per-watt achieved, the trajectory toward higher-efficiency hardware, and the engagement with hardware vendors and cloud providers on efficiency improvements.

The Hugging Face AI Energy Score leaderboard publishes per-model energy figures alongside hardware metadata, providing benchmarks that the practitioner can compare against.⁶ The European Union Corporate Sustainability Reporting Directive (CSRD) ESRS E1 disclosure includes year-over-year energy intensity, which makes hardware-efficiency improvements directly visible in corporate reporting.⁷ The Organisation for Economic Co-operation and Development (OECD) AI Principles’ lifecycle framing supports the practitioner’s expectation that hardware-efficiency decisions are integrated into the AI program rather than delegated entirely to infrastructure teams.⁸

Summary

Hardware efficiency is the per-useful-operation energy consumption of the accelerator hardware. The principal categories — general-purpose GPUs, purpose-built TPUs and dataflow processors, low-power NPUs for edge deployment, and custom ASICs — trade flexibility against efficiency. Each accelerator generation delivers 2-4x operations-per-watt improvement, making the hardware refresh cycle a sustainability lever — but the embodied-carbon trade-off requires a deliberate optimal-refresh decision rather than a default to the newest hardware. The COMPEL D19 maturity rubric requires hardware-aware efficiency metrics in model cards at Level 3 and hardware-aware optimization as standard practice at Level 4. The next article, M1.9Carbon-Aware Scheduling: Time-of-Day and Region-Based Workload Placement, develops the scheduling lever that aligns workload execution with the lowest-emission hours and regions.

International Energy Agency, “Electricity 2024.” https://www.iea.org/reports/electricity-2024 — accessed 2026-04-26. ↩
Stanford CRFM, “Foundation Model Transparency Index.” https://crfm.stanford.edu/fmti/ — accessed 2026-04-26. ↩
McKinsey & Company, “The state of AI.” https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai — accessed 2026-04-26. ↩
COMPEL Domain D19 maturity rubric, Levels 3 and 4. See shared/data/compelDomains.ts. ↩
Green Software Foundation. https://greensoftware.foundation/ — accessed 2026-04-26. ↩
Hugging Face, “AI Energy Score Leaderboard.” https://huggingface.co/spaces/AIEnergyScore/Leaderboard — accessed 2026-04-26. ↩
Directive (EU) 2022/2464 on Corporate Sustainability Reporting. https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A32022L2464 — accessed 2026-04-26. ↩
Organisation for Economic Co-operation and Development, “OECD AI Principles.” https://oecd.ai/en/ai-principles — accessed 2026-04-26. ↩