AI
9 min read

NVIDIA Blackwell vs Hopper: A vLLM Benchmark

We benchmark vLLM on NVIDIA Blackwell vs Hopper. See scripts, results, and cost tips to pick the right GPU for throughput and latency.

NVIDIA Blackwell vs Hopper: A vLLM Benchmark

Quick answer

Want the fastest vLLM serving today? Pick NVIDIA Blackwell (B200). It delivers much higher throughput at similar latency than Hopper (H100/H200) in public runs.

In new InferenceMAX results and the vLLM team write-up, Blackwell shows up to 4.3× more throughput on gpt-oss 120B and up to 3.7× on Llama 3.3 70B, at similar SLAs.

If you already own Hopper, you still get great value—just tune concurrency. But for new buys focused on scale, Blackwell wins.

Why this guide

This article gives you a practical, copy-paste path to run a vLLM benchmark on both Blackwell and Hopper. You get clear steps, example commands, and a simple cost-per-million-tokens calculator. We link to public data so you can compare your numbers to industry results.

What we measured

  • Throughput (tokens/second, total and output TPS)
  • Latency: time to first token (TTFT) and time per output token (TPOT)
  • Pareto trade-off: the best throughput you can get at a given latency target

We use vLLM’s own tooling (vllm bench serve) and follow public setups from vLLM recipes and InferenceMAX. NVIDIA and vLLM have shipped kernel and runtime upgrades for Blackwell that improve overlap scheduling, graph fusions, and more—see NVIDIA’s InferenceMAX v1 summary and this Blackwell performance brief.

Key results at a glance

Scenario Blackwell (B200) Hopper (H100/H200) Notes
gpt-oss 120B, wide interactivity Up to ~4.3× more throughput Baseline Reported by vLLM x InferenceMAX
Llama 3.3 70B, long OSL Up to ~3.7× more throughput Baseline Also from vLLM x InferenceMAX
Busy production loads Higher TPS at similar SLA Strong, but lower TPS See InferenceMAX
Single-stream latency Competitive; framework-dependent Competitive Framework choices can change winners (comparison)

Important: results vary by model, precision, and framework. While this guide focuses on vLLM, some public reports show TensorRT-LLM leading on B200 for certain setups. That’s expected; different engines optimize different paths. Here we give you a clean, reproducible vLLM path so you can measure your own workload.

Benchmark setup (reproducible)

Hardware and software

  • GPU: B200 (Blackwell) vs H100/H200 (Hopper)
  • Container base: runpod/pytorch:2.8.0-py3.11-cuda12.8.1 (used in public tests)
  • vLLM: For Blackwell, build from source if pip lags support; for Hopper, pip is fine (notes)
  • Driver/CUDA: Match container CUDA and host drivers

Models (examples)

Blackwell-ready container (build from source)

# Start a clean CUDA 12.8 image
docker run --gpus all -it --rm runpod/pytorch:2.8.0-py3.11-cuda12.8.1 bash
# Install vLLM from source (Blackwell often needs this)
pip install --upgrade pip
apt-get update && apt-get install -y git
git clone https://github.com/vllm-project/vllm.git && cd vllm
pip install -e .
# Optional: flash-attn for performance
pip install flash-attn --no-build-isolation
# Serve a benchmark-ready model (example: Llama 3.3 70B FP4)
vllm bench serve \
 --host 0.0.0.0 \
 --port 8000 \
 --model nvidia/Llama-3.3-70B-Instruct-FP4 \
 --trust-remote-code \
 --dataset-name random \
 --random-input-len 1024 \
 --random-output-len 1024 \
 --ignore-eos \
 --max-concurrency 512 \
 --num-prompts 2560 \
 --dtype bfloat16 \
 --save-result --result-filename vllm_b200.json

Why build from source on Blackwell? Some users hit “no kernel image is available.” The fix is compiling vLLM against the new SM target, as noted in this multi-GPU benchmark write-up.

Hopper-ready container (pip install)

docker run --gpus all -it --rm runpod/pytorch:2.8.0-py3.11-cuda12.8.1 bash
pip install --upgrade pip
pip install vllm flash-attn --no-build-isolation
vllm bench serve \
 --host 0.0.0.0 \
 --port 8000 \
 --model meta-llama/Llama-3.1-8B-Instruct \
 --dataset-name random \
 --random-input-len 1024 \
 --random-output-len 1024 \
 --ignore-eos \
 --max-concurrency 512 \
 --num-prompts 2560 \
 --save-result --result-filename vllm_hopper.json

You can swap in larger models if your VRAM allows, or use tensor/pipeline parallel as shown in this Ori benchmark.

Benchmark flags that matter

  • --dataset-name random: generates random prompts to stress throughput (see vLLM docs)
  • --random-input-len and --random-output-len: set input/output token lengths (e.g., 1k/1k or 1k/8k)
  • --ignore-eos: keeps decoding to the target length
  • --max-concurrency and --num-prompts: push batch sizes to map the Pareto curve
  • --dtype bfloat16 on Blackwell: recommended to tap hardware acceleration (note)
  • --save-result and --result-filename: persist metrics JSON

How to run the test

  1. Start the server with your chosen model and flags.
  2. Let the benchmark client drive the load and record metrics.
  3. Collect throughput (req/s, tok/s), TTFT, TPOT, and percentiles in the saved JSON.

The output follows the format shown in vLLM’s sample with totals, TTFT, TPOT, and P99 values.

Interpreting results (simple rules)

  • Throughput (tokens/s): higher is better when your app has many users.
  • TTFT (time to first token): matters for chat feel. Lower is better.
  • TPOT (time per output token): steady decoding speed. Lower is better.
  • Pareto curve: the best TPS you can get at a given latency. Move along the curve as your SLA changes.

Public data helps you sanity-check. The vLLM x NVIDIA post shows up to 4.3× (gpt-oss 120B) and 3.7× (Llama 3.3 70B) throughput gains on Blackwell vs Hopper at similar interactivity. A community run on a Blackwell card with a 120B model at 128K context shows scale effects too (Reddit sample). Your exact numbers will depend on model, precision (FP4/FP8/bfloat16), and load shape.

Cost per million tokens (quick calculator)

Use this simple formula to compare GPUs on your workload:

  • Cost per 1M tokens = (GPU $/hour ÷ tokens/second) × 1,000,000 ÷ 3600

Example: If your B200 run achieves 12,000 tokens/s and the GPU costs $40/hour:

  • Cost ≈ (40 ÷ 12000) × 1,000,000 ÷ 3600 ≈ $0.93 per 1M tokens

If an H100 run gives 4,000 tokens/s at $25/hour:

  • Cost ≈ (25 ÷ 4000) × 1,000,000 ÷ 3600 ≈ $1.74 per 1M tokens

These are sample numbers. Plug in your own TPS and cloud prices. The point: Blackwell’s higher TPS often offsets a higher hourly rate. For broader industry context, MLCommons’ MLPerf Inference shows how standard latency targets shape fair comparisons.

Tuning tips that move the needle

  • Increase concurrency until TPS flattens or latency breaks SLA. vLLM’s continuous batching helps pack work.
  • Prefer bfloat16 on Blackwell (recommendation), and test FP8/FP4 when models support it. Watch quality.
  • Use vLLM features like PagedAttention and cache reuse. That’s core to vLLM’s high throughput.
  • Profile TTFT vs TPOT. If TTFT is high, check host overhead, graph fusions, and scheduling; NVIDIA notes recent overlap async scheduling and runtime upgrades in vLLM.
  • Right-size context length. Very long contexts cut TPS. Consider smart truncation or routing.
  • Try parallelism for big models (tensor/pipeline). See the overview in Ori’s guide.
  • Framework trade-offs: Some MoE/precision combos may favor SGLang or TensorRT-LLM (see vLLM issue and framework comparison). Test your exact model.

Common pitfalls and fixes

  • “no kernel image is available” on Blackwell: Build vLLM from source inside the container to target sm_100 (walkthrough).
  • Driver/CUDA mismatch: Align host driver with container CUDA. Use the same base image across runs.
  • OOM at high concurrency: Lower batch, shorten context, or use quantized weights (FP8/FP4). vLLM’s paging helps, but VRAM still matters. For large models, multi-GPU may beat older single-GPU setups (example).
  • Under-utilized GPU: Raise --max-concurrency, ensure CPU isn’t a bottleneck, pin threads, and monitor I/O. Watch TTFT.
  • Apples-to-apples: Keep the same precision, context length, and SLA targets across GPUs (MLPerf rules are a good template).

Blackwell vs Hopper: when to choose which

  • Pick Blackwell if you need max throughput per node, lower $/M tokens at scale, or long-context + high concurrency. Public data shows 3–4× TPS gains at similar latency in many cases (source).
  • Stick with Hopper if you already own it and your SLA is met. You can still improve with better batching, quantization, and parallelism.
  • Edge cases: For single-stream, ultra-low latency, compare vLLM with SGLang and TensorRT-LLM. Some runs show TensorRT-LLM winning on B200.

Optional: energy and sustainability checks

If energy use matters, test under load and record energy per token. A recent study shows how to design realistic, vLLM-based efficiency tests (paper).

How to: run your own Blackwell vs Hopper vLLM benchmark

  1. Pick a model and precision: Start with Llama 3.3 70B FP4 on Blackwell (recipe here).
  2. Match environments: Same base image, drivers, dataset shape, and flags.
  3. Run on Hopper: Use pip-installed vLLM. Save results (JSON).
  4. Run on Blackwell: Build vLLM from source if needed. Use --dtype bfloat16. Save results.
  5. Compare: Plot TPS vs TTFT/TPOT. Read off the best point for your SLA.
  6. Compute cost: Use the formula above with your cloud rates.

FAQ

What is vLLM?

vLLM is a high-throughput LLM serving engine with features like PagedAttention and continuous batching. It’s used in public benchmarks like InferenceMAX.

What do TTFT and TPOT mean?

TTFT is how fast the first token shows up. TPOT is the average time per token after that. Low TTFT feels snappy; low TPOT means fast generation.

Do I need Blackwell to use vLLM?

No. vLLM runs on Hopper and earlier. Blackwell just gives a big speed boost for many loads as recent benchmarks show.

I see “no kernel image is available.” What now?

On new Blackwell cards, compile vLLM from source inside the container. This targets the new architecture (guide).

Is TensorRT-LLM faster than vLLM on B200?

Sometimes, depending on model and precision. See this comparison. Always test your own setup.

Further reading

Bottom line

If you’re serving many users, Blackwell + vLLM is a strong bet. It boosts tokens/second, keeps latency in check, and can drop cost per million tokens. Use the steps here, plug in your prices, and choose the GPU that hits your SLA at the best price.

vLLMNVIDIA BlackwellHopperLLM inference

Related Articles

More insights you might find interesting