The VRAM Challenge
The latest AI models demand enormous amounts of GPU memory. A 70B parameter language model requires ~140 GB in FP16, far exceeding any consumer GPU. Even "small" models like SDXL (6.6B params) need ~13 GB, barely fitting on an RTX 4070 Ti. This guide covers practical techniques to reduce VRAM usage by 50ā80% without significant quality loss.
Quantization: The Biggest Win
Quantization reduces the precision of model weights from 16-bit floats to 8-bit or 4-bit integers. This cuts memory usage proportionally while maintaining surprisingly good quality.
| Format | Bits per Weight | VRAM (7B model) | Quality Impact | 8GB GPU? |
|---|---|---|---|---|
| FP16 | 16 | 14 GB | Baseline | ā No |
| INT8 (GPTQ) | 8 | 7 GB | < 1% degradation | ā Yes |
| INT4 (GPTQ) | 4 | 3.5 GB | 1ā3% degradation | ā Yes |
| AWQ 4-bit | 4 | 3.8 GB | < 1.5% degradation | ā Yes |
| GGUF Q4_K_M | 4.5 avg | 4.1 GB | ~1% degradation | ā Yes |
| INT2 (QuIP#) | 2 | 1.8 GB | 5ā10% degradation | ā Yes |
For most use cases, AWQ 4-bit quantization offers the best quality-to-VRAM ratio. It's activation-aware, meaning it preserves precision for the most important weights while aggressively quantizing less critical ones.
CPU Offloading
When the model exceeds GPU VRAM even after quantization, you can offload some layers to system RAM. This trades speed for capacity:
# llama.cpp GPU offloading ./main -m model.gguf -ngl 28 # 28 layers on GPU, rest on CPU # Adjust -ngl based on your VRAM: # 8GB GPU: -ngl 20-28 # 12GB GPU: -ngl 32-40 # 24GB GPU: -ngl 99 (all on GPU)
Each layer offloaded to CPU adds ~10ms latency per token. For a 32-layer model with half on CPU, expect 2ā3Ć slower generation. Still usable for most purposes.
Memory-Efficient Attention
Standard attention computes the full NĆN attention matrix, which for long contexts can consume gigabytes. Flash Attention 2 computes attention in blocks, reducing memory from O(N²) to O(N):
- Flash Attention 2: Works on NVIDIA GPUs with compute capability ā„ 8.0 (Ampere+). Automatically used by most frameworks.
- xFormers: Cross-platform alternative. Slightly slower than Flash Attention but supports older GPUs.
- Sliding Window Attention: Limits attention to a local window. Reduces both memory and compute. Used by Mistral models.
Practical GPU Buying Guide
| GPU | VRAM | Price (USD) | Max Model Size (4-bit) | Best For |
|---|---|---|---|---|
| RTX 3060 12GB | 12 GB | ~$250 | ~20B | Budget LLM, SDXL |
| RTX 4070 Ti Super | 16 GB | ~$800 | ~30B | Most local AI tasks |
| RTX 4090 | 24 GB | ~$1,600 | ~45B | Power users, video gen |
| RTX 5090 | 32 GB | ~$2,000 | ~60B | 70B models, heavy workloads |