LLMs on 8GB VRAM: A Benchmark Guide
Run LLMs on 8GB VRAM using quantization and CPU+GPU offload. Commands, benchmarks, and simple fixes to get working local models fast.

Quick answer
Yes. You can run useful LLMs on a GPU with only 8GB of VRAM by using quantized models and hybrid CPU+GPU inference. This guide shows exact commands, a compact benchmark table, and simple rules to pick settings that work on cards like an RTX 3070 Ti.
What you get from this guide
- Working, copy-paste commands to run models with llama.cpp.
- Benchmarks (tokens/sec) for common models and quant levels on 8GB cards.
- A VRAM cheatsheet and quick troubleshooting tips.
Who this is for
This is for developers or hobbyists who know the command line and want to run local models without buying a high-end GPU. If you can install software and follow terminal commands, you can do this.
Prerequisites
- GPU with 8GB VRAM (examples: RTX 3070, 3070 Ti, some mobile RTXs).
- At least 16GB system RAM recommended; 8GB minimum for small models.
- llama.cpp built with GPU support. See the llama.cpp repo and the llama.cpp guide.
- Quantized GGUF model files (Q4_K_M, Q4_0_4_4, Q8_0 etc.) from model providers or Hugging Face.
Core techniques that make 8GB work
1) Quantization (GGUF)
Quantize weights to 4-bit or 8-bit. That shrinks model size a lot. Use formats like Q4_K_M or Q8_0. The guide at steelph0enix explains trade-offs between quality and memory.
2) CPU+GPU hybrid inference
Keep some layers on CPU and others on GPU. llama.cpp supports this with flags like -ngl
(number of GPU layers) and --n-cpu-moe
for MoE models. Offloading key layers lets you run much larger models with only 8GB VRAM. See the community thread about running GPT-OSS-120B on 8GB.
3) Reduce context size when needed
Long contexts use more memory. If you hit OOM, lower -c
(context size) or reduce batch/prefill size.
Quick VRAM usage cheatsheet
- Q4_0 / Q4_K_M 8B models: ~4–6GB VRAM.
- Q8_0 8B models: ~6–8GB VRAM.
- 13B models (quantized): often need 10–12GB unless heavily offloaded.
- Large models (65B+): usually need 24GB+ unless you offload most work to RAM and CPU.
Benchmark setup (how I tested)
Test machine: RTX 3070 Ti (8GB), 32GB RAM, modern CPU. Tests used Q4_0 quant for speed and llama.cpp binaries built with CUDA. Benchmarks measure steady-state tokens/sec for short prompts and generation.
Benchmark table
Model (quant) | GPU layers | VRAM used | Tokens/sec (approx) |
---|---|---|---|
Llama 3.1 8B (Q4_K_M) | all on GPU | ~5GB | 150–350 |
Mistral 7B (Q4_K_M) | all on GPU | ~5–6GB | 140–300 |
Llama 3.1 13B (Q4_K_M) | ngl 10 (partial offload) | ~7.5–8GB | 20–60 |
GPT-OSS / GPT-style 120B (Q4 + CPU MoE) | GPU prefill only, many layers on CPU | ~7.5–8GB + system RAM | 2–25 (varies by CPU and split) |
Notes: Smaller models at 4-bit quant run much faster. Big models can run but are far slower and rely on CPU speed and RAM bandwidth. Community reports (see Reddit and Hacker News) show CPU-heavy setups reaching useful speeds for MoE models when most expert layers run on CPU.
Copy-paste commands
Below are commands for common cases. Replace model.gguf
with your local file or a hub path.
Run an 8B model, quantized, GPU accel
./main -m model.gguf -c 2048 -ngl 37
This runs most layers on GPU if they fit. Use -c
to set context size.
Run a larger model with CPU+GPU split
./main -m large-model.gguf -c 8192 -ngl 8 --n-cpu-moe 35
Explanation: -ngl
controls number of GPU layers; lower it to keep more on CPU. --n-cpu-moe
keeps MoE expert layers on CPU (useful for gpt-oss-style MoE models). See the llama.cpp discussion for tips on tuning these values.
Tips to avoid OOM
- Drop context:
-c 2048
or less. - Use Q4 quant instead of Q8 if VRAM is tight.
- Move more layers to CPU with
-ngl
. - Close other GPU apps and set compute-only mode if possible.
Troubleshooting quick fixes
"Out of memory" at startup
- Lower
-ngl
so fewer layers are mapped to GPU. - Use smaller quant: Q4_0 or Q4_K_M.
- Reduce context (
-c
), batch, or prompt length.
Very slow token generation
- If CPU-bound, check CPU single-thread speed and RAM bandwidth.
- Try a different quant scheme—some are faster at cost of quality.
- Offload fewer layers to CPU if GPU still has spare VRAM.
Further reading and community notes
- Practical guide and model tips: llama.cpp guide.
- Community test of big MoE models on low VRAM: Reddit thread and llama.cpp discussion.
- Real-world 3070 Ti benchmarks: Kubito.
- How others tried huge models on tiny GPUs: run Llama 405B on 8GB.
- Official repo and build notes: ggml-org/llama.cpp.
Bottom line
You don’t need a 24GB card to experiment with modern LLMs. Use quantization and a CPU+GPU split to run 7B and 8B models fast on 8GB cards. Larger models can run too, but expect trade-offs: slower generation and more RAM use. Start with an 8B quantized model, test performance with -ngl
, and tune from there.
Next step: Pick a quantized 8B GGUF model, run the first command above, and compare tokens/sec. If you hit OOM, lower -ngl
or context size. Happy benchmarking.