Skip to main content
AITF M1.9-Art04 v1.0 Reviewed 2026-04-06 Open Access
M1.9 M1.9
AITF · Foundations

Inference Optimization for Sustainability: Quantization, Distillation, Pruning

Inference Optimization for Sustainability: Quantization, Distillation, Pruning — AI Use Case Management — Foundation depth — COMPEL Body of Knowledge.

8 min read Article 4 of 15

This article surveys the three techniques, the trade-offs each introduces, and the operational practices that make optimization a standard part of the deployment pipeline rather than a one-off project.

Quantization

Quantization is the practice of representing model weights and activations in lower-precision numerical formats. The baseline format for most modern foundation models is 16-bit floating-point (FP16 or BF16); the quantized formats are typically 8-bit integer (INT8), 4-bit integer (INT4), or in some cases mixed-precision schemes that quantize different layers to different precisions.

The energy savings come from two sources. First, lower-precision arithmetic requires less energy per operation — an INT8 multiply consumes roughly one-quarter the energy of an FP16 multiply on hardware that supports both natively. Second, lower-precision weights require less memory bandwidth — the dominant energy cost in modern inference is moving weights from high-bandwidth memory into the compute units, and reducing weight precision linearly reduces this cost.

The accuracy cost of quantization is typically small for well-quantized models — often under 1 percentage point on standard benchmarks for INT8, and 1-3 percentage points for INT4. The Hugging Face AI Energy Score leaderboard publishes per-model energy figures at multiple precision levels, allowing direct comparison of the accuracy-energy trade-off for specific models.1

The practical practices for quantization include post-training quantization (applying the quantization after the model is trained, with optional calibration on a small representative dataset) and quantization-aware training (training the model with simulated low-precision arithmetic so that it learns to be robust to quantization). For most enterprise use cases, post-training quantization with calibration is sufficient and is supported by all major inference frameworks.

Distillation

Distillation is the practice of training a smaller “student” model to reproduce the outputs of a larger “teacher” model. The student is trained on a dataset of teacher-generated outputs (or on a combination of teacher outputs and original ground-truth labels), and the result is a model that is typically 5-50x smaller than the teacher with comparable accuracy on the use case the teacher was distilled for.

The energy savings from distillation are dramatic because they apply to every dimension of inference cost — fewer parameters means less memory bandwidth, less compute, less latency, and less idle-capacity provisioning. A 7-billion-parameter student distilled from a 70-billion-parameter teacher typically consumes one-tenth the inference energy at comparable task-specific accuracy.

The accuracy cost depends on the alignment between the distillation training data and the production query distribution. Distillation is most effective when the teacher’s strengths on the production tasks can be captured by a representative training-data sample; it is least effective when the teacher’s value comes from broad general-purpose capability that the student cannot internalize at smaller scale.

The McKinsey State of AI surveys have documented that distillation is increasingly the production architecture for enterprise generative-AI deployments — a small, distilled model serves the high-volume production traffic, with escalation to a larger model for the small fraction of queries that the distilled model cannot handle confidently.2

Pruning

Pruning is the practice of removing weights, neurons, or entire structural components (attention heads, layers) from a trained model on the basis that they contribute little to the model’s predictions. The two main variants are unstructured pruning (removing individual weights, which produces a sparse weight matrix that requires sparse-aware hardware to realize the energy savings) and structured pruning (removing entire structural components, which produces a smaller dense model that runs efficiently on standard hardware).

Structured pruning is the more practically deployable technique for most enterprise workloads. A 30-50% pruned model typically retains 95%+ of the unpruned model’s accuracy at 30-50% lower inference energy. Combined with quantization, structured pruning compounds the savings.

The accuracy cost of pruning is highly dependent on the model architecture and the use case. Modern foundation models are typically over-parameterized for any specific use case, which is why pruning is usually viable; but the practitioner should always evaluate the pruned model against the use case’s evaluation set before deploying.

Combining the techniques

The three techniques compose. A typical production-grade optimization pipeline is: distill the foundation model into a use-case-specific student; structurally prune the student to remove the components that the use case does not exercise; quantize the pruned student to INT8 or INT4 for serving. The composed pipeline can produce a model that consumes one-twentieth to one-fiftieth of the original foundation model’s inference energy at comparable use-case accuracy.

The Green Software Foundation has documented case studies in which composed optimization pipelines have reduced data-center inference energy for a generative-AI service by 90%+ while maintaining the service-level objectives that the business required.3

Maturity Indicators

The COMPEL D19 maturity rubric specifies that at Level 4 (Advanced), “model efficiency optimization (distillation, pruning, quantization) is standard practice.”4 The Level 4 indicator is satisfied when the optimization pipeline is integrated into the standard deployment pipeline — every model that goes to production has been evaluated for, and where appropriate optimized via, the three techniques. An organization that is doing optimization as a one-off engineering project for a flagship deployment but not as a standard practice is at Level 3 on this dimension; the transition to Level 4 is the institutionalization of the practice.

The Stanford Foundation Model Transparency Index (FMTI) compute-layer scores have begun to reward providers for publishing inference-energy figures at multiple precision levels and for distilled variants, which is creating an industry-wide expectation that the optimization layer is documented as part of the model card.5

Practical Application

A foundational practitioner who is institutionalizing optimization should produce three artifacts.

Artifact 1: the optimization-decision tree. A document that, given a model and a use case, walks the practitioner through the decision of which techniques to apply and in what order. The tree should be calibrated to the organization’s hardware (some hardware does not realize savings from sparse pruning) and to the organization’s accuracy tolerances.

Artifact 2: the deployment-pipeline integration. The MLOps deployment pipeline should include an optimization stage that — by default — distills, prunes, and quantizes the candidate model and produces a comparison report of the optimized variant against the unoptimized baseline. The default should be to deploy the optimized variant unless an explicit accuracy-justification is recorded.

Artifact 3: the back-catalog optimization audit. The same annual review that this module’s earlier articles introduced for over-specified models should also evaluate already-deployed models for optimization opportunities. Models deployed before the optimization pipeline was institutionalized are typically the largest savings opportunities.

The Greenhouse Gas Protocol’s Scope 2 and Scope 3 categories provide the accounting frame within which the optimization-driven energy reductions are recognized as Scope 2 emission reductions for owned data centers and Scope 3 reductions for cloud-procured compute.6 The European Union Corporate Sustainability Reporting Directive (CSRD) ESRS E1 climate-change disclosure is the corporate-reporting frame within which year-over-year energy-intensity improvements from optimization are recognized.7

Summary

Inference optimization — quantization, distillation, and pruning — is the largest sustainability lever available to a practitioner after model selection. Quantization reduces per-operation energy and memory bandwidth at low accuracy cost; distillation produces a smaller use-case-specific student at one-tenth or less the inference energy of the teacher; structured pruning removes the components that the use case does not exercise. The three techniques compose, with combined savings of one to two orders of magnitude. The COMPEL D19 maturity rubric requires optimization as standard practice at Level 4, which is satisfied by integrating the optimization pipeline into the standard MLOps deployment flow. The next article, M1.9Green Data Center Strategies for AI Workloads, develops the facility-level practices that determine the per-kilowatt-hour emission factor that the optimized inference is multiplied by.



© FlowRidge.io — COMPEL AI Transformation Methodology. All rights reserved.

Footnotes

  1. Hugging Face, “AI Energy Score Leaderboard.” https://huggingface.co/spaces/AIEnergyScore/Leaderboard — accessed 2026-04-26.

  2. McKinsey & Company, “The state of AI.” https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai — accessed 2026-04-26.

  3. Green Software Foundation, “Software Carbon Intensity Specification” and case studies. https://greensoftware.foundation/ — accessed 2026-04-26.

  4. COMPEL Domain D19 maturity rubric, Level 4. See shared/data/compelDomains.ts.

  5. Stanford CRFM, “Foundation Model Transparency Index.” https://crfm.stanford.edu/fmti/ — accessed 2026-04-26.

  6. Greenhouse Gas Protocol, “Corporate Standard.” https://ghgprotocol.org/ — accessed 2026-04-26.

  7. Directive (EU) 2022/2464 on Corporate Sustainability Reporting. https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A32022L2464 — accessed 2026-04-26.