The VRAM Challenge

The latest AI models demand enormous amounts of GPU memory. A 70B parameter language model requires ~140 GB in FP16, far exceeding any consumer GPU. Even "small" models like SDXL (6.6B params) need ~13 GB, barely fitting on an RTX 4070 Ti. This guide covers practical techniques to reduce VRAM usage by 50–80% without significant quality loss.

Quantization: The Biggest Win

Quantization reduces the precision of model weights from 16-bit floats to 8-bit or 4-bit integers. This cuts memory usage proportionally while maintaining surprisingly good quality.

FormatBits per WeightVRAM (7B model)Quality Impact8GB GPU?
FP161614 GBBaselineāŒ No
INT8 (GPTQ)87 GB< 1% degradationāœ… Yes
INT4 (GPTQ)43.5 GB1–3% degradationāœ… Yes
AWQ 4-bit43.8 GB< 1.5% degradationāœ… Yes
GGUF Q4_K_M4.5 avg4.1 GB~1% degradationāœ… Yes
INT2 (QuIP#)21.8 GB5–10% degradationāœ… Yes
šŸ’” Pro Tip

For most use cases, AWQ 4-bit quantization offers the best quality-to-VRAM ratio. It's activation-aware, meaning it preserves precision for the most important weights while aggressively quantizing less critical ones.

CPU Offloading

When the model exceeds GPU VRAM even after quantization, you can offload some layers to system RAM. This trades speed for capacity:

# llama.cpp GPU offloading
./main -m model.gguf -ngl 28  # 28 layers on GPU, rest on CPU
# Adjust -ngl based on your VRAM:
# 8GB GPU:  -ngl 20-28
# 12GB GPU: -ngl 32-40
# 24GB GPU: -ngl 99 (all on GPU)

Each layer offloaded to CPU adds ~10ms latency per token. For a 32-layer model with half on CPU, expect 2–3Ɨ slower generation. Still usable for most purposes.

Memory-Efficient Attention

Standard attention computes the full NƗN attention matrix, which for long contexts can consume gigabytes. Flash Attention 2 computes attention in blocks, reducing memory from O(N²) to O(N):

  • Flash Attention 2: Works on NVIDIA GPUs with compute capability ≄ 8.0 (Ampere+). Automatically used by most frameworks.
  • xFormers: Cross-platform alternative. Slightly slower than Flash Attention but supports older GPUs.
  • Sliding Window Attention: Limits attention to a local window. Reduces both memory and compute. Used by Mistral models.

Practical GPU Buying Guide

GPUVRAMPrice (USD)Max Model Size (4-bit)Best For
RTX 3060 12GB12 GB~$250~20BBudget LLM, SDXL
RTX 4070 Ti Super16 GB~$800~30BMost local AI tasks
RTX 409024 GB~$1,600~45BPower users, video gen
RTX 509032 GB~$2,000~60B70B models, heavy workloads