AI
9 min read

Local LLM Performance: The 2025 Benchmark

Benchmark local LLMs in 2025 with a simple method. Get hardware tips and a free kit to run your own tests.

Local LLM Performance: The 2025 Benchmark

Local LLM Performance: The 2025 Benchmark

What you’ll get: a simple way to measure local LLMs, pick the right model for your hardware, and ship faster with fewer surprises.

Download the Benchmark Kit (scripts, prompt sets, and a results template). Run everything in under 60 minutes.

TL;DR: Best picks by hardware tier (quick table)

Tier Example hardware Model picks (2025) Quantization Speed target Primary use
CPU-only Modern 8 – 16 core CPU Mistral 7B, Qwen 7B, Gemma 2 7B Q4_K_M / Q5_K_M 15 – 35 tok/s Summaries, light RAG
Apple Silicon M3/M4 Pro or Max Llama 3.1 8B, Phi-3.5 mini, Qwen 7B Q4_K_M 25 – 60 tok/s Chat, docs, dev notes
Mid GPU RTX 3060 – 4070 Llama 3.1 8B/70B (split), Qwen2.5 14B 8B FP16 or Q3_K_M for larger 50 – 120 tok/s App UX, code assist
High GPU RTX 4080/4090 Llama 3.1 70B, Qwen2.5 32B, Mixtral 8x7B FP16/FP8 (8x7B) or Q4/Q5 for 70B 80 – 200+ tok/s Prod RAG, coding, long docs
Workstation 2 – 4x high-VRAM GPUs 70B – 120B class (sharded) BF16/FP16 100 – 300+ tok/s Heavy batch, agentic tasks

Numbers are ranges from community reports, tools like LocalScore, and token speed simulators such as this benchmark tool. Your results will vary by driver, memory bandwidth, and model build.

Why benchmark local LLMs in 2025?

What metrics should you measure?

Short answer: measure both speed and quality, plus memory and stability. That’s how you avoid surprises in production.

Speed and latency

  • Tokens per second (steady-state). Higher is better for UX.
  • Time to first token (TTFT). Lower is better for perceived speed.
  • Throughput under concurrency (N parallel chats).

Memory and compute

  • VRAM/RAM in use at idle and under load.
  • CPU/GPU utilization and thermals (watch throttling).
  • Context length and effective length (how it handles long inputs).

Quality and safety

  • Task accuracy on your prompts: summarization, code, or RAG.
  • Hallucination/faithfulness via spot checks or LLM-as-judge.
  • Domain checks where needed. Studies like this central bank paper and its PDF show local LLMs can handle finance text; see also a medical example here.

Stability and ops

  • Crash rate and out-of-memory events.
  • Warm vs cold start times.
  • Energy draw if running 24/7.

For a broad eval overview, skim CSET’s explainer and this 2025 arXiv comparison.

What tools should you use?

How do you run the benchmark?

Use our kit as a template. It standardizes prompts, measures tokens/sec and TTFT, and logs memory.

  1. Install Ollama and pull models.
  2. Run speed tests (steady-state) and latency tests (TTFT).
  3. Run task tests on your target workload (e.g., summarization, coding, or RAG).
  4. Log VRAM/RAM and errors. Save results to CSV.
# 1) Install Ollama
# macOS: brew install ollama | Windows/Linux: see docs
# 2) Pull a few models
ollama pull llama3.1:8b
ollama pull mistral:7b
ollama pull qwen2.5:14b

# 3) Speed + latency quick check
# (a) Steady-state speed
/time-ollama --model llama3.1:8b --prompt "Write three bullet points about SSDs." --repeat 5
# (b) Time-to-first-token (TTFT)
/latency-ollama --model llama3.1:8b --prompt "Explain JSON in one sentence." --repeat 10

# 4) Task eval (summarization)
ollama run mistral:7b "Summarize the following transcript in 8 bullets: ..."

# 5) RAG eval (LLM-as-judge optional)
rag-bench --embed e5-small --rerank bge-base --generate llama3.1:8b --eval judge=llama3.1:8b

# 6) Export results
bench-export --out results.csv

Tip: keep tests short and repeat 5 – 10 times to smooth noise. If you test GPU models, monitor clock speeds and temps; throttling can hide real capacity.

How do you pick the right model for your hardware?

  1. Start with your budget and device: If you’re CPU-only, stick to 7B class with Q4 – Q5 quantization. If you own a 4090, try 70B split or Mixture-of-Experts (MoE) models.
  2. Pick speed first: Hit your target tok/s at your typical prompt size. Sub-20 tok/s feels slow for chat.
  3. Then check quality: Run 10 – 20 prompts from your real workload. Track errors and hallucinations.
  4. Lock the config: Record model, quantization, params (top_p, temperature), and seed. You need this for reproducibility.

For a sanity check across models, you can browse a public leaderboard like Vellum’s (helpful, but still test locally your prompts are unique).

What do typical results look like in 2025?

These are example ranges you can expect on common rigs. Always test your exact setup.

Rig Model/quant Tokens/sec TTFT VRAM/RAM
M3 Pro (18GB) Mistral 7B Q4_K_M 25 – 45 150 – 400 ms 8 – 10 GB RAM
RTX 3060 (12GB) Llama 3.1 8B FP16 45 – 80 120 – 300 ms 10 – 12 GB VRAM
RTX 4090 (24GB) Llama 3.1 70B split Q4 80 – 140 200 – 450 ms 22 – 30 GB VRAM
CPU 12-core Qwen 7B Q5_K_M 18 – 30 250 – 600 ms 7 – 9 GB RAM

Why it matters: your UX depends on both TTFT and steady-state tok/s. See this CPU vs GPU benchmark explainer for context, and these leaderboard trends for quality signals.

How do you improve speed without breaking quality?

  • Quantization: prefer GGUF Q4_K_M for 7B – 14B. Use Q5 when you need sharper answers. Learn the trade-offs in our Quantization Explained.
  • Right-size context: don’t over-inflate tokens. Summarize or chunk first.
  • Cache smartly: reuse embeddings and pre-prompts. Warm your model at boot.
  • GPU settings: pin the model to GPU, avoid swapping. Watch VRAM headroom (10 – 20% slack helps stability).
  • CPU threads: match threads to physical cores; avoid hyperthread overload.
  • Model choice: some code models fly on CPU. See the practical notes in this 2025 benchmark and ops tips in this deployment guide.

How do you measure accuracy for your task?

Use a small, labeled set and score it the same way each time.

  1. Create a gold set (20 – 50 prompts) that reflect your real task.
  2. Score with rubrics: correctness, relevance, and faithfulness (1 – 5 scale).
  3. Automate checks for format and factual claims where possible.
  4. Review drift monthly. Re-run the same set to catch regressions.

Need ideas? This eval guide and metric catalog show options beyond BLEU/ROUGE. In niche domains, see financial-text evaluation in this study and the public summary from the regional Fed.

What’s a simple, repeatable test plan?

  1. Speed suite: measure TTFT and tokens/sec on 3 prompt sizes (short, medium, long) for each model/quantization.
  2. Quality suite: run your labeled set for summarization, code, or RAG. Score and store.
  3. Stability suite: 30-minute soak with parallel chats (N=4 – 8). Track crashes and OOMs.
  4. Regression: re-run after driver/model updates. Compare to last good build.

Want a visual? Local hardware scores in LocalScore plus speed estimates from the token simulator can forecast whether a bigger model fits your device.

Does local beat cloud?

Often, for privacy and predictable cost. For raw accuracy on complex tasks, top cloud models can still win. A fair comparison from 2025: local vs cloud models and live leaderboards like this one. Your call depends on compliance needs, budget, and latency targets.

Common configs that work well

  • CPU-only dev box: Qwen 7B Q4/Q5 for notes and summaries. Keep context under 4k.
  • MacBook Pro (M3/M4): Llama 3.1 8B Q4 for chat + RAG. Great travel setup.
  • RTX 4070/4080: 8B FP16 or 14B Q4 for snappy UX; try Mixtral 8x7B for broader skills.
  • RTX 4090: 70B split Q4 for high-quality answers; mind VRAM and temps.

FAQ

Can I run a 70B model without a big GPU?

Yes, with quantization and CPU/GPU splits, but it’s slower. For real-time chat, a strong GPU helps.

What’s the fastest way to boost speed?

Use a smaller model with better prompts, drop quant to Q4, and trim context. Those three changes usually 2 – 3x speedup.

Is BLEU/ROUGE enough to judge quality?

No. Add task-specific checks and human review. See metric advice in this evaluation guide.

Do local LLMs work for regulated data?

They can. That’s the point: data stays local. Ensure disk encryption, audit logs, and role controls.

How do I choose quantization?

Start with Q4_K_M for speed, move to Q5 if answers feel thin. Learn more in Quantization Explained.

Where do I learn setup basics?

Follow our Ollama Setup and hardware notes in Best GPUs for AI. Also see community tools in this 2025 roundup.

References and further reading

Next steps

  1. Pick two models per tier (e.g., 7B and 14B).
  2. Run the speed + quality suites from the kit.
  3. Lock the fastest config that meets your quality bar.
  4. Re-test monthly or after driver/model updates.

Bottom line: speed gets you UX, quality gets you trust. Measure both. Then standardize your stack and ship.

Local LLMOllamaBenchmarkingQuantization

Related Articles

More insights you might find interesting