LLM VRAM calculator
Running models locally? Pick a model size and quantization to see how much GPU VRAM you need — and which cards can actually run it.
Or type a custom size below.
Most popular for local use.
Estimated VRAM required
—
weights + ~20% for KV cache & overhead
Which GPUs can run it?
| GPU | VRAM | Tier | Fits? |
|---|
Estimates for inference. Long contexts grow the KV cache and need more VRAM; training needs far more. A model that "just fits" may be slow — leave headroom.
How much VRAM do you need to run an LLM?
The rule of thumb: VRAM ≈ parameters × bytes-per-parameter × ~1.2. A model's weights are the bulk of it, and the bytes-per-parameter depends on the quantization you run:
- FP16/BF16 uses 2 bytes per parameter — best quality, biggest footprint. A 7B model needs ~16 GB.
- 8-bit (Q8) halves that to ~1 byte/param with almost no quality loss.
- 4-bit (Q4_K_M) — the most popular local format — uses ~0.5 bytes/param, so a 7B model fits in ~5 GB and a 70B model in ~40 GB.
The extra ~20% covers the KV cache (which grows with your context length) and runtime overhead. For long contexts, add more headroom. Want to skip the hardware entirely? Compare cheap hosted APIs instead — often cheaper than buying a GPU until you are at serious volume.