Run Qwen3-Next Locally with Llama.cpp
Run Qwen3-Next locally with llama.cpp: step-by-step install, VRAM guide, and copy-paste commands for NVIDIA, Apple Silicon, and CPU-only setups.

Run Qwen3-Next Locally with Llama.cpp
Quick answer: Yes — recent builds of llama.cpp and the Qwen docs offer paths to run Qwen3-Next locally using GGUF quant files. You need the right build, a compatible GGUF, and enough VRAM. Below are step-by-step commands, a VRAM guide, and troubleshooting tips so you can get started fast.
Why this is different
Qwen3-Next uses a hybrid design with gated DeltaNet linear attention and a high-sparsity Mixture of Experts (MoE) layer. That means runtimes needed to add new kernels and conversion tools. Community threads like the Hacker News discussion and the llama.cpp issue show work is in progress, but practical builds already let hobbyists run smaller Qwen3-Next variants.
What you’ll need
- Hardware: NVIDIA GPU with 12+ GB VRAM for small quantized variants, 24+ GB for larger models, or Apple Silicon with a full-RAM approach.
- OS & tools: Linux or macOS; git, make, a C++ toolchain, and (optional) Hugging Face / Qwen model links.
- Software: a recent llama.cpp build that includes Qwen3 support, or Ollama/LMStudio if you prefer a GUI path.
VRAM estimates by model & quant
These are rough, tested estimates. Exact needs vary by sequence length and flags.
Model | Typical GGUF Quant | Approx VRAM (NVIDIA) |
---|---|---|
Qwen3-30B-A3B | Q8_0 / Q4_K_M | 12–16 GB |
Qwen3-80B-A3B | Q8_0 / Q4_K_M (heavy) | 32+ GB (split/offload recommended) |
Qwen3-Next small test builds | Q8_0 | 8–12 GB |
Step 1 — Build llama.cpp (fast)
Clone and build the repo. Use the validated PR or a recent release that mentions Qwen3 support. The Qwen docs link shows example commands and flags.
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
make clean && make -j$(nproc)
Tip: if you see build errors, update your compiler or try make
with fewer jobs.
Step 2 — Get a Qwen3-Next GGUF
Download an official or community-provided GGUF from Hugging Face or the Qwen docs. The Qwen docs explain using GGUF and include example llama.cpp commands: Qwen llama.cpp guide. If you must convert, use validated conversion scripts from the model repo.
Step 3 — Run llama-cli (copy-paste)
Start with a small model or quant to test. Replace MODEL_PATH with your GGUF file.
# basic run
./build/bin/llama-cli -m MODEL_PATH --threads 8 --ctx 2048 --temp 0.7
# example with Qwen3 flags
./build/bin/llama-cli -m Qwen/Qwen3-30B-A3B:Q8_0 --ctx 2048 --temp 0.6 --top-p 0.95
For a server mode, see the Qwen docs example for llama-server
: llama-server example.
Apple Silicon and CPU-only users
- Apple M1/M2: build on macOS and prefer smaller quant files. You may need more RAM than VRAM.
- CPU-only: expect slow runs. Use low-memory quant options and short context lengths.
Using Ollama or LMStudio
If you prefer a higher-level tool, Ollama and LMStudio already provide workflows for Qwen3 models. The Qwen repo and community posts note that Ollama user flows work for some Qwen3 builds — try Qwen3 repo or the community guides.
Troubleshooting
- Model not loading: confirm GGUF file integrity and that your llama.cpp build includes the Qwen3 PR. See the community discussion at issue #15940.
- Crashes or CL errors: these often indicate missing GPU kernels for linear attention. Try CPU fallback or smaller quant until kernels land in your build.
- Out of memory: reduce context length, use a lower-precision quant, or offload layers if your runtime supports it.
FAQ
Will llama.cpp support full Qwen3-Next features? The community is actively adding support; see the PR and discussions. For production, vLLM and frameworks mentioned in the Red Hat guide offer optimized paths: vLLM guide.
Where to follow updates? Watch the llama.cpp issue, the Qwen docs, and community threads like r/LocalLLaMA for progress and small-test validations.
Quick tip
We recommend starting with a small Qwen3-Next test model and a Q8 quant. Confirm the run, then scale up. Quick tip: keep a copy of the exact command you used so you can reproduce or file a precise bug report.
We’ve got your back — pop an issue in the llama.cpp repo if you hit an odd failure and include the model name, exact command, and a tiny log. Happy testing!