AI Glossary

What is Model Distillation?

Training a smaller "student" model to replicate the behavior of a larger "teacher" model at lower cost.

By Council Research TeamUpdated: Jan 27, 2026

Definition

Model distillation (knowledge distillation) is a technique where a smaller, more efficient "student" model is trained to mimic the outputs and internal representations of a larger, more capable "teacher" model. Rather than training the student from scratch on raw data, it learns from the teacher's soft probability distributions over outputs, which contain richer information than hard labels alone. This transfers the teacher's learned knowledge into a more compact form. Distilled models can achieve 90-99% of the teacher's quality at a fraction of the size and inference cost. Many production AI systems use distilled models for latency-sensitive applications.

Examples

1DistilBERT achieving 97% of BERT's performance at 60% of the size and 2x the speed

2Smaller GPT models trained on outputs from GPT-4 to create efficient API models

3Gemini Nano distilled from larger Gemini models for on-device mobile deployment

4Multi-task distillation where one student learns from multiple specialized teachers

Why It Matters

Distillation is why smaller AI models can be surprisingly capable — they learned from larger ones. It explains the quality gap between model tiers and why "fast" models can still produce good results.

Related Terms

Pruning

Removing unnecessary parameters from a neural network to make it smaller and faster without significant quality loss.

AI Inference Optimization

Techniques that make AI models generate responses faster and cheaper without reducing output quality.

Model Merging

Combining weights from multiple fine-tuned models into a single model that inherits capabilities from each.

LoRA (Low-Rank Adaptation)

A parameter-efficient fine-tuning method that trains small adapter matrices instead of modifying the full model.

Common Questions

What does Model Distillation mean in simple terms?

Training a smaller "student" model to replicate the behavior of a larger "teacher" model at lower cost.

Why is Model Distillation important for AI users?

How does Model Distillation relate to AI chatbots like ChatGPT?

Model Distillation is a fundamental concept in how AI assistants like ChatGPT, Claude, and Gemini work. For example: DistilBERT achieving 97% of BERT's performance at 60% of the size and 2x the speed Understanding this helps you use AI tools more effectively.

Related Use Cases

Best AI for Coding

Best AI for Writing

AI Models Using This Concept

Claude

ChatGPT

Gemini

See Model Distillation in Action

Council lets you compare responses from multiple AI models side-by-side. Experience different approaches to the same prompt instantly.

Browse AI Glossary

Large Language Model (LLM)Prompt Engineering AI Hallucination Context Window Token (AI)RAG (Retrieval-Augmented Generation)Fine-Tuning Temperature (AI)Multimodal AI AI Agent

Definition

Examples

1DistilBERT achieving 97% of BERT's performance at 60% of the size and 2x the speed

2Smaller GPT models trained on outputs from GPT-4 to create efficient API models

3Gemini Nano distilled from larger Gemini models for on-device mobile deployment

4Multi-task distillation where one student learns from multiple specialized teachers

Common Questions

What does Model Distillation mean in simple terms?

Training a smaller "student" model to replicate the behavior of a larger "teacher" model at lower cost.