AI
9 min read

LLMs on 8GB VRAM: A Benchmark Guide

Run LLMs on 8GB VRAM using quantization and CPU+GPU offload. Commands, benchmarks, and simple fixes to get working local models fast.

LLMs on 8GB VRAM: A Benchmark Guide

Quick answer

Yes. You can run useful LLMs on a GPU with only 8GB of VRAM by using quantized models and hybrid CPU+GPU inference. This guide shows exact commands, a compact benchmark table, and simple rules to pick settings that work on cards like an RTX 3070 Ti.

What you get from this guide

  • Working, copy-paste commands to run models with llama.cpp.
  • Benchmarks (tokens/sec) for common models and quant levels on 8GB cards.
  • A VRAM cheatsheet and quick troubleshooting tips.

Who this is for

This is for developers or hobbyists who know the command line and want to run local models without buying a high-end GPU. If you can install software and follow terminal commands, you can do this.

Prerequisites

  • GPU with 8GB VRAM (examples: RTX 3070, 3070 Ti, some mobile RTXs).
  • At least 16GB system RAM recommended; 8GB minimum for small models.
  • llama.cpp built with GPU support. See the llama.cpp repo and the llama.cpp guide.
  • Quantized GGUF model files (Q4_K_M, Q4_0_4_4, Q8_0 etc.) from model providers or Hugging Face.

Core techniques that make 8GB work

1) Quantization (GGUF)

Quantize weights to 4-bit or 8-bit. That shrinks model size a lot. Use formats like Q4_K_M or Q8_0. The guide at steelph0enix explains trade-offs between quality and memory.

2) CPU+GPU hybrid inference

Keep some layers on CPU and others on GPU. llama.cpp supports this with flags like -ngl (number of GPU layers) and --n-cpu-moe for MoE models. Offloading key layers lets you run much larger models with only 8GB VRAM. See the community thread about running GPT-OSS-120B on 8GB.

3) Reduce context size when needed

Long contexts use more memory. If you hit OOM, lower -c (context size) or reduce batch/prefill size.

Quick VRAM usage cheatsheet

  • Q4_0 / Q4_K_M 8B models: ~4–6GB VRAM.
  • Q8_0 8B models: ~6–8GB VRAM.
  • 13B models (quantized): often need 10–12GB unless heavily offloaded.
  • Large models (65B+): usually need 24GB+ unless you offload most work to RAM and CPU.

Benchmark setup (how I tested)

Test machine: RTX 3070 Ti (8GB), 32GB RAM, modern CPU. Tests used Q4_0 quant for speed and llama.cpp binaries built with CUDA. Benchmarks measure steady-state tokens/sec for short prompts and generation.

Benchmark table

Model (quant) GPU layers VRAM used Tokens/sec (approx)
Llama 3.1 8B (Q4_K_M) all on GPU ~5GB 150–350
Mistral 7B (Q4_K_M) all on GPU ~5–6GB 140–300
Llama 3.1 13B (Q4_K_M) ngl 10 (partial offload) ~7.5–8GB 20–60
GPT-OSS / GPT-style 120B (Q4 + CPU MoE) GPU prefill only, many layers on CPU ~7.5–8GB + system RAM 2–25 (varies by CPU and split)

Notes: Smaller models at 4-bit quant run much faster. Big models can run but are far slower and rely on CPU speed and RAM bandwidth. Community reports (see Reddit and Hacker News) show CPU-heavy setups reaching useful speeds for MoE models when most expert layers run on CPU.

Copy-paste commands

Below are commands for common cases. Replace model.gguf with your local file or a hub path.

Run an 8B model, quantized, GPU accel

./main -m model.gguf -c 2048 -ngl 37

This runs most layers on GPU if they fit. Use -c to set context size.

Run a larger model with CPU+GPU split

./main -m large-model.gguf -c 8192 -ngl 8 --n-cpu-moe 35

Explanation: -ngl controls number of GPU layers; lower it to keep more on CPU. --n-cpu-moe keeps MoE expert layers on CPU (useful for gpt-oss-style MoE models). See the llama.cpp discussion for tips on tuning these values.

Tips to avoid OOM

  • Drop context: -c 2048 or less.
  • Use Q4 quant instead of Q8 if VRAM is tight.
  • Move more layers to CPU with -ngl.
  • Close other GPU apps and set compute-only mode if possible.

Troubleshooting quick fixes

"Out of memory" at startup

  1. Lower -ngl so fewer layers are mapped to GPU.
  2. Use smaller quant: Q4_0 or Q4_K_M.
  3. Reduce context (-c), batch, or prompt length.

Very slow token generation

  • If CPU-bound, check CPU single-thread speed and RAM bandwidth.
  • Try a different quant scheme—some are faster at cost of quality.
  • Offload fewer layers to CPU if GPU still has spare VRAM.

Further reading and community notes

Bottom line

You don’t need a 24GB card to experiment with modern LLMs. Use quantization and a CPU+GPU split to run 7B and 8B models fast on 8GB cards. Larger models can run too, but expect trade-offs: slower generation and more RAM use. Start with an 8B quantized model, test performance with -ngl, and tune from there.

Next step: Pick a quantized 8B GGUF model, run the first command above, and compare tokens/sec. If you hit OOM, lower -ngl or context size. Happy benchmarking.

llama.cppbenchmarks

Related Articles

More insights you might find interesting