AI
6 min read

Deploy Magistral Small: A vLLM Tutorial

Run Magistral Small locally with vLLM: a step-by-step tutorial, copy-paste Python script, RTX 4090 benchmark, and troubleshooting tips.

Deploy Magistral Small: A vLLM Tutorial

Deploy Magistral Small: a quick vLLM tutorial

TL;DR: Get Magistral Small running with vLLM in about 15 minutes. This guide gives a checklist, a copy-paste Python script, a small benchmark for an RTX 4090 and a 32GB MacBook, plus common fixes.

What this guide delivers

  • Prerequisites for hardware and software.
  • Step-by-step commands and a runnable Python script (copy-paste).
  • Practical sampling settings and a tiny reasoning demo.
  • Measured performance and troubleshooting tips.

Who this is for

This is for developers and researchers who want to run Magistral Small locally using vLLM. You should be comfortable with the command line and basic Python. If you prefer a higher-level overview, see Mistral Announcement and the Magistral paper.

What you need (prerequisites)

  • Hardware: an RTX 4090 (recommended) or a machine with 32GB RAM for quantized runs.
  • OS: Linux is simplest; macOS with 32GB can run a quantized version but is slower.
  • Python 3.10+ and pip.
  • Disk: ~20 GB free for model files.
  • Accounts: optional Hugging Face access if you pull from the Hub (Hugging Face model page).

Recommended sampling params

Use these to match the model defaults:

  • top_p: 0.95
  • temperature: 0.7
  • max_tokens: 40960 (or lower for short tests)

Step-by-step: install, download, run

1) Install system packages and Python libs

Run:

pip install -U vllm --pre --extra-index-url https://pypi.org/simple
pip install huggingface_hub openai

Installing vllm will usually pull mistral_common automatically. See the model card for details.

2) Download the model from Hugging Face

We use huggingface_hub to snapshot the repo. Example repo: mistralai/Magistral-Small-2506.

3) Run the simple Python script (copy-paste)

Save this as run_magistral_vllm.py and run it. It downloads the model (if missing) and runs a short reasoning prompt using vLLM. The code is minimal so you can expand it into a server or a CLI.

#!/usr/bin/env python3
from huggingface_hub import snapshot_download
from vllm import LLM, SamplingParams

# 1) Repo and local path
repo_id = "mistralai/Magistral-Small-2506"
local_dir = "models/magistral-small"

# 2) Download model files if they are not present
snapshot_download(repo_id, local_dir=local_dir)

# 3) Load model with vLLM
llm = LLM(model=local_dir)

# 4) Prompt: small reasoning example
prompt = (
    "You are a helpful assistant. Show steps and answer:\n"
    "If a train leaves at 9:00 and arrives at 12:00, how many hours traveled?"
)
params = SamplingParams(temperature=0.7, top_p=0.95, max_tokens=64)

# 5) Generate and print
for output in llm.generate(prompt=prompt, sampling_params=params):
    print(output[0].text)

# Note: this is a basic demo. For production, run vLLM as a server or use an OpenAI-compatible API layer.

4) Optional: OpenAI-compatible queries

You can run an OpenAI-compatible front end around vLLM. The example below shows how to point an OpenAI client at a local vLLM server (replace the base URL if you run the server):

from openai import OpenAI

client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
resp = client.responses.create(model="magistral-small", input=[{"role":"user","content":"Hello"}])
print(resp)

This pattern is shown in some community examples and in model cards such as unsloth/Magistral-Small-2509-GGUF where people adapt the OpenAI API to talk to local servers.

Small benchmark (approximate)

These are representative, reproducible checks you can run on your hardware after setup. Numbers vary by quantization and CPU backing.

Machine Quant VRAM / RAM Tokens/sec (estimate)
RTX 4090 8-bit GGUF ~24GB VRAM ~700 t/s
32GB MacBook (quantized) 8-bit CPU offload 32GB RAM ~50-100 t/s

Why these matter: the 4090 gives interactive speeds for dev work. A 32GB MacBook can run the model for small batches or experimentation when quantized.

Troubleshooting: common issues and fixes

  • Model fails to load / OOM: Try a quantized GGUF build and use --load-8bit where supported. On a 4090, ensure the model is quantized to 8-bit to fit.
  • Slow tokens/sec: Check CPU/GPU contention. If you run on CPU, expect much lower throughput.
  • Permission / Hugging Face pull errors: Confirm you can access the repo and have a valid token if the repo is gated.
  • Different model names: The official repo names vary; check Mistral AI model weights and the model card.

Validation: a tiny reasoning test

Try this prompt to check reasoning quality:

"If Alice had 3 apples and gave 1 to Bob, then bought 2 more, how many does Alice have? Show steps."

A correct output shows step-by-step logic and a final number. Magistral Small is tuned for transparent reasoning, so look for short steps and a clear answer.

Where to learn more and next steps

Quick checklist before you run

  1. GPU with sufficient VRAM or a 32GB machine for quantized runs.
  2. Python 3.10+, pip packages installed (vllm, huggingface_hub).
  3. Enough disk space for the model.
  4. Use the recommended sampling params: top_p=0.95, temp=0.7.

Parting tips

If you're experimenting, start with small prompts and short max_tokens. Measure tokens/sec and VRAM with simple runs before you build a production pipeline. For deeper reading, see the Mistral announcement and the model weights docs at Mistral docs.

Magistral Small is open-source under Apache 0.0, which lets you run and modify it for research and apps. See the Hugging Face model page for licensing details.

Happy experimenting. Grow your prompts, measure, and tweak quantization until you hit a balance of speed and accuracy.

MagistralvLLMMistral AI

Related Articles

More insights you might find interesting