What is covered in Magistral Small 1.2: A Hands-On Guide?

Run Magistral Small 1.2 locally: download GGUF, run with vllm or llama.cpp, and try a text+image prompt in ~15 minutes.

Magistral Small 1.2: A Hands-On Guide

Quick answer

What changed: Magistral Small 1.2 is a 24B, Apache 2.0 reasoning model with a vision encoder. Result: you can run text+image prompts locally on an RTX 4090 or a 32GB MacBook. Download the checklist and follow the steps below.

What this guide does

Short and useful: you ll download GGUF weights, start a local server, and run a text+image prompt. No fluff. Target time: about 15 minutes if you meet the prerequisites.

What s new in 1.2

Vision encoder: handles images and text together. Read the overview on Simon Willison s blog.
Better reasoning: ~+15% on math and coding benchmarks vs 1.1; see reporting at Dev.ua and VentureBeat.
Open license: Apache 2.0 on Hugging Face.

Before you start (prerequisites)

Hardware: RTX 4090 GPU or a MacBook with 32GB RAM (quantized weights).
Disk: ~100 GB free for model files and cache.
Software: Python 3.10+, pip, vllm or llama.cpp, and mistral-common if using GGUF servers.
Accounts: Hugging Face account recommended for downloads.

1 Download the GGUF weights

Grab the GGUF quantized file from Hugging Face. Example using the CLI:

pip install "huggingface_hub[cli]"
huggingface-cli download "mistralai/Magistral-Small-2509-GGUF" --include "Magistral-Small-2509-Q4_K_M.gguf" --local-dir "Magistral-Small-2509-GGUF/"

2 Run with llama.cpp (fast local option)

Good for quantized runs on a single GPU or CPU. Start a local server like this:

llama-server -m Magistral-Small-2509-Q4_K_M.gguf -c 0

This uses the GGUF file you downloaded. See the GGUF notes on Hugging Face for details and compat tips.

3 Run with vllm for multimodal use

If you need vision + reasoning or the Mistral reasoning parser, use vllm. Example start command (from Mistral docs):

vllm serve mistralai/Magistral-Small-2509 --reasoning-parser mistral --tokenizer_mode mistral --config_format mistral --load_format mistral --tool-call-parser mistral --enable-auto-tool-choice --limit-mm-per-prompt '{"image":10}' --tensor-parallel-size 2

This enables built-in tool use and lets the model accept image attachments in prompts. See Mistral docs for flags and limits.

Quick example: text + image prompt

Basic curl example that posts a prompt and an image URL to a running vllm server:

curl -X POST http://localhost:8000/predict -H "Content-Type: application/json" -d '{"prompt":"Describe the image and list 3 facts.","images":["https://example.com/photo.jpg"]}'

Tip: use an image URL the server can reach. If you run into errors, check server logs and the model page for format guidance.

Example: ask the model to use a tool

Start vllm with --enable-auto-tool-choice. Then send a prompt that asks for a web lookup or code run. The model will choose the tool when appropriate. This improves workflows like code debugging or research lookups.

Troubleshooting & tips

Vision missing: Some GGUF exports omit vision support. If images fail, try the official HF repo or the vllm load format. See the GGUF repo notes on Hugging Face GGUF.
Low VRAM: Use Q4 quantized weights or increase tensor parallel size.
Slow responses: Lower context length, reduce batch size, or use fewer threads.
Chain-of-thought: Magistral is designed to produce transparent reasoning traces for verifiable answers; read background at VentureBeat.

FAQ

Can I run vision on a MacBook? Yes, with quantized weights and enough RAM, but expect trade-offs in speed.
Are the weights open? Yes. Magistral Small 1.2 is released under Apache 2.0 on Hugging Face.
Which model supports long context (128k)? Newer Mistral releases list 128k context support; check Mistral docs for exact versions and limits.

Where to read more

That s it. Follow the steps above and you ll be running a text+image prompt locally in about 15 minutes. If you hit a blocker, check the model repo and open issues for the latest fixes.