AI
5 min read

Magistral Small 1.2: A Hands-On Guide

Run Magistral Small 1.2 locally: download GGUF, run with vllm or llama.cpp, and try a text+image prompt in ~15 minutes.

Magistral Small 1.2: A Hands-On Guide

Quick answer

What changed: Magistral Small 1.2 is a 24B, Apache 2.0 reasoning model with a vision encoder. Result: you can run text+image prompts locally on an RTX 4090 or a 32GB MacBook. Download the checklist and follow the steps below.

What this guide does

Short and useful: you ll download GGUF weights, start a local server, and run a text+image prompt. No fluff. Target time: about 15 minutes if you meet the prerequisites.

What s new in 1.2

Before you start (prerequisites)

  • Hardware: RTX 4090 GPU or a MacBook with 32GB RAM (quantized weights).
  • Disk: ~100 GB free for model files and cache.
  • Software: Python 3.10+, pip, vllm or llama.cpp, and mistral-common if using GGUF servers.
  • Accounts: Hugging Face account recommended for downloads.

1 Download the GGUF weights

Grab the GGUF quantized file from Hugging Face. Example using the CLI:

pip install "huggingface_hub[cli]"
huggingface-cli download "mistralai/Magistral-Small-2509-GGUF" --include "Magistral-Small-2509-Q4_K_M.gguf" --local-dir "Magistral-Small-2509-GGUF/"

2 Run with llama.cpp (fast local option)

Good for quantized runs on a single GPU or CPU. Start a local server like this:

llama-server -m Magistral-Small-2509-Q4_K_M.gguf -c 0

This uses the GGUF file you downloaded. See the GGUF notes on Hugging Face for details and compat tips.

3 Run with vllm for multimodal use

If you need vision + reasoning or the Mistral reasoning parser, use vllm. Example start command (from Mistral docs):

vllm serve mistralai/Magistral-Small-2509 --reasoning-parser mistral --tokenizer_mode mistral --config_format mistral --load_format mistral --tool-call-parser mistral --enable-auto-tool-choice --limit-mm-per-prompt '{"image":10}' --tensor-parallel-size 2

This enables built-in tool use and lets the model accept image attachments in prompts. See Mistral docs for flags and limits.

Quick example: text + image prompt

Basic curl example that posts a prompt and an image URL to a running vllm server:

curl -X POST http://localhost:8000/predict -H "Content-Type: application/json" -d '{"prompt":"Describe the image and list 3 facts.","images":["https://example.com/photo.jpg"]}'

Tip: use an image URL the server can reach. If you run into errors, check server logs and the model page for format guidance.

Example: ask the model to use a tool

Start vllm with --enable-auto-tool-choice. Then send a prompt that asks for a web lookup or code run. The model will choose the tool when appropriate. This improves workflows like code debugging or research lookups.

Troubleshooting & tips

  • Vision missing: Some GGUF exports omit vision support. If images fail, try the official HF repo or the vllm load format. See the GGUF repo notes on Hugging Face GGUF.
  • Low VRAM: Use Q4 quantized weights or increase tensor parallel size.
  • Slow responses: Lower context length, reduce batch size, or use fewer threads.
  • Chain-of-thought: Magistral is designed to produce transparent reasoning traces for verifiable answers; read background at VentureBeat.

FAQ

  • Can I run vision on a MacBook? Yes, with quantized weights and enough RAM, but expect trade-offs in speed.
  • Are the weights open? Yes. Magistral Small 1.2 is released under Apache 2.0 on Hugging Face.
  • Which model supports long context (128k)? Newer Mistral releases list 128k context support; check Mistral docs for exact versions and limits.

Where to read more

That s it. Follow the steps above and you ll be running a text+image prompt locally in about 15 minutes. If you hit a blocker, check the model repo and open issues for the latest fixes.

MagistralMistral AILocal LLM

Related Articles

More insights you might find interesting