LLM Quantization 2026: GGUF vs AWQ vs GPTQ vs FP8
4-bit quantization cuts LLM inference cost 60 to 80% and fits a 70B model on one GPU. Here is which format to pick for CPU, GPU serving, and edge in 2026.

Running a capable open model used to mean renting expensive GPUs you could not afford to keep busy. Quantization changed that math. By compressing model weights from 16-bit down to 8-, 4-, or even fewer bits, you shrink the memory footprint dramatically while keeping almost all the quality, and a 70B model that needed 140GB of VRAM suddenly fits on a single high-end GPU. In 2026 quantization is the foundation that makes self-hosted LLMs affordable. The catch is choosing the right format, because GGUF, AWQ, GPTQ, and FP8 each win a different deployment.
Quick answer
Pick the quantization format by your hardware, not by which sounds best. Use GGUF for CPU, laptop, and edge inference; AWQ for production GPU serving, where it gives the best 4-bit quality for instruction-tuned models in 2026; keep existing GPTQ checkpoints but do not start new projects on them; and use FP8 or INT8 when you want near-perfect quality and can spare the VRAM. 4-bit quantization typically costs 2 to 3% of model quality while cutting inference cost 60 to 80% and dropping a 70B model from about 140GB to 35-40GB of VRAM.
Key takeaways
- 4-bit quantization cuts inference cost 60 to 80% and drops a 70B model from about 140GB to 35 to 40GB of VRAM, fitting on a single RTX 4090 or A100.
- GGUF is the format for CPU and mixed CPU+GPU inference, the default for edge and local setups.
- AWQ is the default for production GPU serving in 2026, best quality at 4-bit for instruction-tuned models.
- GPTQ has been largely superseded by AWQ for new releases but remains fine for existing checkpoints.
- Quality retention at 4-bit is high, roughly 97 to 98% across the leading formats.
What quantization actually does
A model's weights are numbers. By default they are stored at 16-bit precision, which is accurate but memory-hungry. Quantization rounds those numbers to fewer bits, 8-bit, 4-bit, sometimes lower, so each weight takes less space. Less memory means the model fits on cheaper hardware, and smaller weights also move faster, so you get higher throughput.
The surprise is how little quality you lose. At 4-bit, the leading formats retain roughly 97 to 98% of the original model's quality, a trade most production teams take happily for a 60 to 80% cost reduction. This is the practical lever behind running open-weight LLMs economically, and it pairs naturally with choosing the right inference engine to serve them.
Note
The headline trade: 4-bit quantization typically costs you 2 to 3% of model quality and saves you 60 to 80% of your inference bill. For most production workloads that is an easy yes.
The format guide
The formats are not competitors so much as specialists for different hardware.
GGUF comes from the llama.cpp project and is built for CPU and mixed CPU+GPU inference. If you are running locally, on a laptop, or on edge hardware without a big GPU, GGUF is the format. Its Q4_K_M variant is the popular sweet spot, around 1,934 tokens/second with GPU offload and about 97.8% quality retention.
AWQ is the 2026 default for production GPU serving. It achieves the best quality at 4-bit, particularly for instruction-tuned models, and posts strong throughput, around 2,847 tokens/second at 4-bit with 98.1% quality retention. When a new model drops, it tends to land in AWQ first.
GPTQ was the previous GPU standard and still works fine, around 2,612 tokens/second and 98.4% quality retention, but it has been largely superseded by AWQ for new releases. It remains a reasonable choice for existing GPTQ checkpoints and has the largest legacy library.
FP8 and INT8 are the lighter-touch options, 8-bit precision that loses almost nothing in quality while still roughly halving memory. Use these when you want maximum quality and can spare more VRAM than 4-bit needs.
Here is the same picture in one table, with the indicative throughput and quality-retention figures the 2026 benchmarks report:
| Format | Best for | Bits | Throughput (indicative) | Quality retained |
|---|---|---|---|---|
| GGUF (Q4_K_M) | CPU, laptop, edge | 4 | ~1,934 tok/s with GPU offload | ~97.8% |
| AWQ | Production GPU serving | 4 | ~2,847 tok/s | ~98.1% |
| GPTQ | Existing GPU checkpoints | 4 | ~2,612 tok/s | ~98.4% |
| FP8 / INT8 | Quality-first, VRAM to spare | 8 | High, near-FP16 | ~99%+ |
Treat the throughput numbers as relative, not absolute: they depend heavily on the GPU, batch size, and inference engine. The ranking, however, is stable, AWQ and GPTQ lead on GPU, GGUF owns CPU and mixed setups, and 8-bit formats sacrifice memory savings for the highest fidelity.

The 70B-on-one-GPU example
The clearest illustration: a 70B model at full 16-bit precision needs about 140GB of VRAM, which means multiple high-end GPUs. Quantize it to 4-bit with AWQ or GPTQ and it drops to 35 to 40GB, comfortably fitting on a single RTX 4090 or A100. That single change is the difference between an unaffordable multi-GPU deployment and a single-card one, and it is why quantization, not bigger budgets, is how most teams run large models in 2026. For smaller targets, the same approach makes on-device and small-model agents practical on consumer hardware.
The pattern holds across model sizes. Roughly what each needs in VRAM, before counting the extra headroom for context and the KV cache:
| Model size | FP16 (full) | 4-bit quantized | Fits on |
|---|---|---|---|
| 7B | ~14 GB | ~4-5 GB | Most modern laptops / 8GB GPU |
| 13B | ~26 GB | ~8-9 GB | A 12GB consumer GPU |
| 34B | ~68 GB | ~18-20 GB | A single 24GB card |
| 70B | ~140 GB | ~35-40 GB | One RTX 4090 or A100 |
Budget a few extra gigabytes on top of these figures for the context window, since long prompts and large batches grow the KV cache independently of the weights.
Choosing a format
- CPU, laptop, or edge with no big GPU: use GGUF (Q4_K_M is the popular sweet spot).
- Production GPU serving: use AWQ, the 2026 default with best 4-bit quality for instruction-tuned models.
- Existing GPTQ checkpoints: keep them, GPTQ still works, just do not start new projects on it.
- Maximum quality with VRAM to spare: use FP8 or INT8 instead of going to 4-bit.
- Always benchmark quality on your own task, the 97 to 98% retention is an average, not a guarantee for your workload.
What to do right now
If you are about to deploy a quantized model:
- Identify your hardware first: CPU/edge points to GGUF, a GPU points to AWQ.
- Check the VRAM table above to confirm the size you want fits your card at 4-bit.
- Download the AWQ build for production GPU serving, or Q4_K_M GGUF for local use.
- Reserve headroom for the context window on top of the weight size.
- Benchmark on your own prompts, not the published averages, before committing.
- Step up to FP8 or INT8 only if your task is quality-sensitive and you have the VRAM.
Frequently asked questions
How much quality do I lose at 4-bit?
Roughly 2 to 3%, leaving about 97 to 98% retention across GGUF, AWQ, and GPTQ. For most production workloads that is a trivial cost against a 60 to 80% reduction in inference spend. Sensitive tasks should still benchmark on their own data to confirm.
Should I use GGUF or AWQ?
GGUF for CPU, laptop, and edge inference, it is built for hardware without a big GPU. AWQ for production GPU serving, where it is the 2026 default with the best 4-bit quality for instruction-tuned models. Match the format to your hardware, not the other way around.
Is GPTQ obsolete?
Not obsolete, just superseded for new work. AWQ generally matches or beats it on quality and is where new models land first. Keep your existing GPTQ checkpoints, they are fine, but start new GPU serving projects on AWQ.
Can quantization really fit a 70B model on one GPU?
Yes. At full precision a 70B model needs about 140GB of VRAM; 4-bit quantization drops it to 35 to 40GB, which fits on a single RTX 4090 or A100. That is the central reason quantization, rather than more hardware, is how teams run large models affordably.
The takeaway
Quantization is what makes self-hosted LLMs affordable in 2026, cutting cost 60 to 80% while keeping 97 to 98% of quality. Pick by hardware: GGUF for CPU and edge, AWQ for production GPU serving, FP8 or INT8 when quality matters most and VRAM is plentiful. Get the format right and a model that once demanded a GPU cluster runs on a single card.
Sources & further reading
- sesamedisk.com/quantization-techniques-ai-inference-2026/
- fungies.io/llm-quantization-gguf-awq-gptq-guide-2026/
- branch8.com/posts/quantization-llm-inference-cost-optimization-apac-guide
- digitalapplied.com/blog/gguf-vs-awq-vs-gptq-vs-mlx-llm-quantization-formats-2026
- vrlatech.com/llm-quantization-explained-int4-int8-fp8-awq-and-gptq-in-2026/


