Magistral Small 1.2: A Hands-On Guide
Run Magistral Small 1.2 locally: download GGUF, run with vllm or llama.cpp, and try a text+image prompt in ~15 minutes.

Quick answer
What changed: Magistral Small 1.2 is a 24B, Apache 2.0 reasoning model with a vision encoder. Result: you can run text+image prompts locally on an RTX 4090 or a 32GB MacBook. Download the checklist and follow the steps below.
What this guide does
Short and useful: you ll download GGUF weights, start a local server, and run a text+image prompt. No fluff. Target time: about 15 minutes if you meet the prerequisites.
What s new in 1.2
- Vision encoder: handles images and text together. Read the overview on Simon Willison s blog.
- Better reasoning: ~+15% on math and coding benchmarks vs 1.1; see reporting at Dev.ua and VentureBeat.
- Open license: Apache 2.0 on Hugging Face.
Before you start (prerequisites)
- Hardware: RTX 4090 GPU or a MacBook with 32GB RAM (quantized weights).
- Disk: ~100 GB free for model files and cache.
- Software: Python 3.10+, pip,
vllm
orllama.cpp
, andmistral-common
if using GGUF servers. - Accounts: Hugging Face account recommended for downloads.
1 Download the GGUF weights
Grab the GGUF quantized file from Hugging Face. Example using the CLI:
pip install "huggingface_hub[cli]"
huggingface-cli download "mistralai/Magistral-Small-2509-GGUF" --include "Magistral-Small-2509-Q4_K_M.gguf" --local-dir "Magistral-Small-2509-GGUF/"
2 Run with llama.cpp (fast local option)
Good for quantized runs on a single GPU or CPU. Start a local server like this:
llama-server -m Magistral-Small-2509-Q4_K_M.gguf -c 0
This uses the GGUF file you downloaded. See the GGUF notes on Hugging Face for details and compat tips.
3 Run with vllm for multimodal use
If you need vision + reasoning or the Mistral reasoning parser, use vllm
. Example start command (from Mistral docs):
vllm serve mistralai/Magistral-Small-2509 --reasoning-parser mistral --tokenizer_mode mistral --config_format mistral --load_format mistral --tool-call-parser mistral --enable-auto-tool-choice --limit-mm-per-prompt '{"image":10}' --tensor-parallel-size 2
This enables built-in tool use and lets the model accept image attachments in prompts. See Mistral docs for flags and limits.
Quick example: text + image prompt
Basic curl example that posts a prompt and an image URL to a running vllm server:
curl -X POST http://localhost:8000/predict -H "Content-Type: application/json" -d '{"prompt":"Describe the image and list 3 facts.","images":["https://example.com/photo.jpg"]}'
Tip: use an image URL the server can reach. If you run into errors, check server logs and the model page for format guidance.
Example: ask the model to use a tool
Start vllm with --enable-auto-tool-choice
. Then send a prompt that asks for a web lookup or code run. The model will choose the tool when appropriate. This improves workflows like code debugging or research lookups.
Troubleshooting & tips
- Vision missing: Some GGUF exports omit vision support. If images fail, try the official HF repo or the vllm load format. See the GGUF repo notes on Hugging Face GGUF.
- Low VRAM: Use Q4 quantized weights or increase tensor parallel size.
- Slow responses: Lower context length, reduce batch size, or use fewer threads.
- Chain-of-thought: Magistral is designed to produce transparent reasoning traces for verifiable answers; read background at VentureBeat.
FAQ
- Can I run vision on a MacBook? Yes, with quantized weights and enough RAM, but expect trade-offs in speed.
- Are the weights open? Yes. Magistral Small 1.2 is released under Apache 2.0 on Hugging Face.
- Which model supports long context (128k)? Newer Mistral releases list 128k context support; check Mistral docs for exact versions and limits.
Where to read more
- Apidog announcement
- Simon Willison s write-up
- Hugging Face model page
- Mistral docs
- VentureBeat coverage
That s it. Follow the steps above and you ll be running a text+image prompt locally in about 15 minutes. If you hit a blocker, check the model repo and open issues for the latest fixes.