Free tool

LLM VRAM calculator

Running models locally? Pick a model size and quantization to see how much GPU VRAM you need — and which cards can actually run it.

Model size (billion parameters)

Or type a custom size below.

Quantization

Which GPUs can run it?

GPU	VRAM	Tier	Fits?

Estimates for inference. Long contexts grow the KV cache and need more VRAM; training needs far more. A model that "just fits" may be slow — leave headroom.

How much VRAM do you need to run an LLM?

The rule of thumb: VRAM ≈ parameters × bytes-per-parameter × ~1.2. A model's weights are the bulk of it, and the bytes-per-parameter depends on the quantization you run:

FP16/BF16 uses 2 bytes per parameter — best quality, biggest footprint. A 7B model needs ~16 GB.
8-bit (Q8) halves that to ~1 byte/param with almost no quality loss.
4-bit (Q4_K_M) — the most popular local format — uses ~0.5 bytes/param, so a 7B model fits in ~5 GB and a 70B model in ~40 GB.

The extra ~20% covers the KV cache (which grows with your context length) and runtime overhead. For long contexts, add more headroom. Want to skip the hardware entirely? Compare cheap hosted APIs instead — often cheaper than buying a GPU until you are at serious volume.