What is covered in Local LLM Performance: The 2025 Benchmark?

Benchmark local LLMs in 2025 with a simple method. Get hardware tips and a free kit to run your own tests.

Local LLM Performance: The 2025 Benchmark

What you’ll get: a simple way to measure local LLMs, pick the right model for your hardware, and ship faster with fewer surprises.

Download the Benchmark Kit (scripts, prompt sets, and a results template). Run everything in under 60 minutes.

TL;DR: Best picks by hardware tier (quick table)

Tier	Example hardware	Model picks (2025)	Quantization	Speed target	Primary use
CPU-only	Modern 8 – 16 core CPU	Mistral 7B, Qwen 7B, Gemma 2 7B	Q4_K_M / Q5_K_M	15 – 35 tok/s	Summaries, light RAG
Apple Silicon	M3/M4 Pro or Max	Llama 3.1 8B, Phi-3.5 mini, Qwen 7B	Q4_K_M	25 – 60 tok/s	Chat, docs, dev notes
Mid GPU	RTX 3060 – 4070	Llama 3.1 8B/70B (split), Qwen2.5 14B	8B FP16 or Q3_K_M for larger	50 – 120 tok/s	App UX, code assist
High GPU	RTX 4080/4090	Llama 3.1 70B, Qwen2.5 32B, Mixtral 8x7B	FP16/FP8 (8x7B) or Q4/Q5 for 70B	80 – 200+ tok/s	Prod RAG, coding, long docs
Workstation	2 – 4x high-VRAM GPUs	70B – 120B class (sharded)	BF16/FP16	100 – 300+ tok/s	Heavy batch, agentic tasks

Numbers are ranges from community reports, tools like LocalScore, and token speed simulators such as this benchmark tool. Your results will vary by driver, memory bandwidth, and model build.

Why benchmark local LLMs in 2025?

Privacy: keep data on your machine; see examples in this local LLM guide and 2025 tool roundups.
Cost control: no per-call API bills; great for high-volume teams and batch jobs.
Fit for purpose: match speed and accuracy to your task, not a generic cloud score; see evaluation notes from Microsoft’s eval playbook and metric choices for RAG.

What metrics should you measure?

Short answer: measure both speed and quality, plus memory and stability. That’s how you avoid surprises in production.

Speed and latency

Tokens per second (steady-state). Higher is better for UX.
Time to first token (TTFT). Lower is better for perceived speed.
Throughput under concurrency (N parallel chats).

Memory and compute

VRAM/RAM in use at idle and under load.
CPU/GPU utilization and thermals (watch throttling).
Context length and effective length (how it handles long inputs).

Quality and safety

Task accuracy on your prompts: summarization, code, or RAG.
Hallucination/faithfulness via spot checks or LLM-as-judge.
Domain checks where needed. Studies like this central bank paper and its PDF show local LLMs can handle finance text; see also a medical example here.

Stability and ops

Crash rate and out-of-memory events.
Warm vs cold start times.
Energy draw if running 24/7.

For a broad eval overview, skim CSET’s explainer and this 2025 arXiv comparison.

What tools should you use?

Ollama for consistent local runs and quick model swaps. See a friendly intro in this guide.
LocalScore to compare your hardware speed across model sizes: benchmark intro.
Speed simulators to estimate tokens/sec before you buy: token generation simulator.
LM Studio if you prefer a GUI (discover models and tune params): see 2025 tool list.
Eval guides for metrics and pitfalls: Microsoft, Chitika.

How do you run the benchmark?

Use our kit as a template. It standardizes prompts, measures tokens/sec and TTFT, and logs memory.

Install Ollama and pull models.
Run speed tests (steady-state) and latency tests (TTFT).
Run task tests on your target workload (e.g., summarization, coding, or RAG).
Log VRAM/RAM and errors. Save results to CSV.

# 1) Install Ollama
# macOS: brew install ollama | Windows/Linux: see docs
# 2) Pull a few models
ollama pull llama3.1:8b
ollama pull mistral:7b
ollama pull qwen2.5:14b

# 3) Speed + latency quick check
# (a) Steady-state speed
/time-ollama --model llama3.1:8b --prompt "Write three bullet points about SSDs." --repeat 5
# (b) Time-to-first-token (TTFT)
/latency-ollama --model llama3.1:8b --prompt "Explain JSON in one sentence." --repeat 10

# 4) Task eval (summarization)
ollama run mistral:7b "Summarize the following transcript in 8 bullets: ..."

# 5) RAG eval (LLM-as-judge optional)
rag-bench --embed e5-small --rerank bge-base --generate llama3.1:8b --eval judge=llama3.1:8b

# 6) Export results
bench-export --out results.csv

Tip: keep tests short and repeat 5 – 10 times to smooth noise. If you test GPU models, monitor clock speeds and temps; throttling can hide real capacity.

How do you pick the right model for your hardware?

Start with your budget and device: If you’re CPU-only, stick to 7B class with Q4 – Q5 quantization. If you own a 4090, try 70B split or Mixture-of-Experts (MoE) models.
Pick speed first: Hit your target tok/s at your typical prompt size. Sub-20 tok/s feels slow for chat.
Then check quality: Run 10 – 20 prompts from your real workload. Track errors and hallucinations.
Lock the config: Record model, quantization, params (top_p, temperature), and seed. You need this for reproducibility.

For a sanity check across models, you can browse a public leaderboard like Vellum’s (helpful, but still test locally your prompts are unique).

What do typical results look like in 2025?

These are example ranges you can expect on common rigs. Always test your exact setup.

Rig	Model/quant	Tokens/sec	TTFT	VRAM/RAM
M3 Pro (18GB)	Mistral 7B Q4_K_M	25 – 45	150 – 400 ms	8 – 10 GB RAM
RTX 3060 (12GB)	Llama 3.1 8B FP16	45 – 80	120 – 300 ms	10 – 12 GB VRAM
RTX 4090 (24GB)	Llama 3.1 70B split Q4	80 – 140	200 – 450 ms	22 – 30 GB VRAM
CPU 12-core	Qwen 7B Q5_K_M	18 – 30	250 – 600 ms	7 – 9 GB RAM

Why it matters: your UX depends on both TTFT and steady-state tok/s. See this CPU vs GPU benchmark explainer for context, and these leaderboard trends for quality signals.

How do you improve speed without breaking quality?

Quantization: prefer GGUF Q4_K_M for 7B – 14B. Use Q5 when you need sharper answers. Learn the trade-offs in our Quantization Explained.
Right-size context: don’t over-inflate tokens. Summarize or chunk first.
Cache smartly: reuse embeddings and pre-prompts. Warm your model at boot.
GPU settings: pin the model to GPU, avoid swapping. Watch VRAM headroom (10 – 20% slack helps stability).
CPU threads: match threads to physical cores; avoid hyperthread overload.
Model choice: some code models fly on CPU. See the practical notes in this 2025 benchmark and ops tips in this deployment guide.

How do you measure accuracy for your task?

Use a small, labeled set and score it the same way each time.

Create a gold set (20 – 50 prompts) that reflect your real task.
Score with rubrics: correctness, relevance, and faithfulness (1 – 5 scale).
Automate checks for format and factual claims where possible.
Review drift monthly. Re-run the same set to catch regressions.

Need ideas? This eval guide and metric catalog show options beyond BLEU/ROUGE. In niche domains, see financial-text evaluation in this study and the public summary from the regional Fed.

What’s a simple, repeatable test plan?

Speed suite: measure TTFT and tokens/sec on 3 prompt sizes (short, medium, long) for each model/quantization.
Quality suite: run your labeled set for summarization, code, or RAG. Score and store.
Stability suite: 30-minute soak with parallel chats (N=4 – 8). Track crashes and OOMs.
Regression: re-run after driver/model updates. Compare to last good build.

Want a visual? Local hardware scores in LocalScore plus speed estimates from the token simulator can forecast whether a bigger model fits your device.

Does local beat cloud?

Often, for privacy and predictable cost. For raw accuracy on complex tasks, top cloud models can still win. A fair comparison from 2025: local vs cloud models and live leaderboards like this one. Your call depends on compliance needs, budget, and latency targets.

Common configs that work well

CPU-only dev box: Qwen 7B Q4/Q5 for notes and summaries. Keep context under 4k.
MacBook Pro (M3/M4): Llama 3.1 8B Q4 for chat + RAG. Great travel setup.
RTX 4070/4080: 8B FP16 or 14B Q4 for snappy UX; try Mixtral 8x7B for broader skills.
RTX 4090: 70B split Q4 for high-quality answers; mind VRAM and temps.

FAQ

Can I run a 70B model without a big GPU?

Yes, with quantization and CPU/GPU splits, but it’s slower. For real-time chat, a strong GPU helps.

What’s the fastest way to boost speed?

Use a smaller model with better prompts, drop quant to Q4, and trim context. Those three changes usually 2 – 3x speedup.

Is BLEU/ROUGE enough to judge quality?

No. Add task-specific checks and human review. See metric advice in this evaluation guide.

Do local LLMs work for regulated data?

They can. That’s the point: data stays local. Ensure disk encryption, audit logs, and role controls.

How do I choose quantization?

Start with Q4_K_M for speed, move to Q5 if answers feel thin. Learn more in Quantization Explained.

Where do I learn setup basics?

Follow our Ollama Setup and hardware notes in Best GPUs for AI. Also see community tools in this 2025 roundup.

References and further reading

Evaluation metrics and best practices: Microsoft, Chitika (RAG eval), CSET.
Local performance and tooling: LocalScore benchmark, token speed simulator, n8n local LLM guide, 2025 tools.
Comparisons and leaderboards: Vellum 2025 leaderboard, Local vs Cloud arXiv 2025.
Domain studies: Finance text eval from the Fed (overview, PDF), medical example here.
Ops and hardware insights: CPU vs GPU (2025), deployment tips.

Next steps

Pick two models per tier (e.g., 7B and 14B).
Run the speed + quality suites from the kit.
Lock the fastest config that meets your quality bar.
Re-test monthly or after driver/model updates.

Bottom line: speed gets you UX, quality gets you trust. Measure both. Then standardize your stack and ship.

Local LLM Performance: The 2025 Benchmark

Local LLM Performance: The 2025 Benchmark

TL;DR: Best picks by hardware tier (quick table)

Why benchmark local LLMs in 2025?

What metrics should you measure?

Speed and latency

Memory and compute

Quality and safety

Stability and ops

What tools should you use?

How do you run the benchmark?

How do you pick the right model for your hardware?

What do typical results look like in 2025?

How do you improve speed without breaking quality?

How do you measure accuracy for your task?

What’s a simple, repeatable test plan?

Does local beat cloud?

Common configs that work well

FAQ

Can I run a 70B model without a big GPU?

What’s the fastest way to boost speed?

Is BLEU/ROUGE enough to judge quality?

Do local LLMs work for regulated data?

How do I choose quantization?

Where do I learn setup basics?

References and further reading

Next steps

Related Articles

Radiology AI Benchmarking: An Evaluation Framework

Magistral Small 1.2: A Hands-On Guide

Claude Prompt Engineering: The Complete Playbook