What is covered in llama.cpp GPU Acceleration: The Complete Guide?

Step-by-step guide to build and run llama.cpp with GPU backends (CUDA, HIP, Metal, OpenCL, Vulkan) plus troubleshooting and key flags.

llama.cpp GPU Acceleration: The Complete Guide

Short answer

If you want faster, local LLM inference with llama.cpp, you must build it with a GPU backend that matches your hardware (CUDA for Nvidia, HIP for AMD, Metal for Apple, OpenCL for Adreno, Vulkan/SYCL for others). Once built, use the --n-gpu-layers, --main-gpu, and --split-mode flags to offload work. Follow the repo build notes and the vendor docs linked below.

GPU compatibility matrix

GPU	Recommended backend	Notes & links
Nvidia	CUDA (+ cuBLAS)	Best performance on Linux/Windows. See llama.cpp repo.
AMD	HIP	Good support via HIP. See llama.cpp and community reports.
Apple Silicon (M1/M2)	Metal	Native Metal backend gives excellent perf on macOS.
Qualcomm Adreno (Android)	OpenCL (new), Vulkan (partial)	Use the new OpenCL backend optimized for Adreno: Qualcomm blog. Vulkan can work on some devices; see community threads: issue #8705 and discussion about Vulkan performance.
Other (Intel, integrated)	SYCL / Vulkan / BLAS	Options vary; check llama.cpp docs.

Quick checklist (what you need)

A compatible GPU and up-to-date drivers.
llama.cpp source (clone from ggml-org/llama.cpp).
GGUF or converted model (quantized models work best).
Build toolchain: CMake, compiler, vendor SDK (CUDA toolkit, ROCm/HIP, Vulkan SDK, Android NDK for mobile).
Patience—mobile builds can need device-specific tweaks.

How to build (examples)

Always check the project build docs first. Below are example commands you can adapt. They’re labelled as examples — your flags or SDK paths may differ.

Example: CUDA (Nvidia)

Install the CUDA toolkit and a matching driver. Then build:

mkdir build && cd build
cmake .. -DUSE_CUDA=ON -DUSE_CUBLAS=ON
make -j$(nproc)

Run with:

./main -m /path/to/model.gguf --n-gpu-layers 999 --main-gpu 0

Example: HIP (AMD)

Install ROCm/HIP per AMD docs. Then:

mkdir build && cd build
cmake .. -DUSE_HIP=ON
make -j$(nproc)

Example: OpenCL (Adreno / mobile)

For Adreno, the new OpenCL backend is recommended. Build with OpenCL enabled:

mkdir build && cd build
cmake .. -DUSE_OPENCL=ON
make -j$(nproc)

On Android you may also need to set runtime env vars (example from a public run):

GGML_OPENCL_PLATFORM=0
GGML_OPENCL_DEVICE=0
export LD_LIBRARY_PATH=/vendor/lib64:$LD_LIBRARY_PATH

Example: Vulkan (Android/Linux)

Vulkan is supported but can be hit-or-miss depending on drivers. Enable Vulkan at build time:

mkdir build && cd build
cmake .. -DUSE_VULKAN=ON
make -j$(nproc)

On Android, some GPUs (Exynos RDNA) show good perf; others (Adreno, Mali) can be slower without driver fixes — see community issue and discussion about Vulkan speed.

Example: Metal (macOS)

On macOS, build with Metal:

mkdir build && cd build
cmake .. -DUSE_METAL=ON
make -j$(sysctl -n hw.ncpu)

Key flags you’ll use

--n-gpu-layers: Number of layers to offload to GPU. Set high (999) to try to load whole model to GPU; set lower to do partial offload.
--main-gpu: Which device ID is the main GPU (0 by default).
--split-mode: Controls CPU/GPU splitting behavior for models larger than VRAM. Try modes 0, 1, 2 to see what’s fastest for your hardware.

How to verify the GPU is being used

Check llama.cpp startup logs — it mentions GPU backend if built correctly (see repo).
For CUDA: run nvidia-smi to see utilization.
For OpenCL: use clinfo or platform tools to confirm devices.
Watch system CPU usage drop when the GPU is active; latency should fall and throughput rise.
Note: some Python wrappers and older builds report GPU incorrectly; see a common detection thread on Stack Overflow.

Troubleshooting: common problems and fixes

Build fails: Install the vendor SDK (CUDA/ROCm/Vulkan SDK/Android NDK). Read the build logs and add include/library paths to CMake as needed.
GPU not used at runtime: Rebuild with the backend enabled. Confirm binary was built after changing CMake flags. Check logs and environment variables.
Very slow on Android Vulkan: Performance varies by GPU/driver. See the community notes: issue #8705 and Vulkan discussion. Try the OpenCL backend for Adreno (see Qualcomm blog).
Out of VRAM: Use quantized GGUF models, lower --n-gpu-layers, or enable split-mode to keep some tensors on CPU. Guides on quantization and model conversion help—see the community guide: llama.cpp guide.

FAQ

Can I run Llama 3 8B on a phone GPU?

Maybe. Some Exynos RDNA phones and recent flagship Qualcomm chips can run trimmed/quantized 8B models with the right backend. Expect compromises: quantization and split-mode are essential. See the Qualcomm OpenCL announcement: OpenCL backend.

Why is Vulkan slow on my device?

Drivers and shader compiler quality vary. Community threads show Adreno and Mali behave differently; follow the linked issues for device-specific tips.

Where do I get quantized models?

Use converters and community tools to produce GGUF quantized models. See model conversion notes in the llama.cpp repo and community guides like this guide.

Next steps and links

Read the main repo and build docs: ggml-org/llama.cpp.
Adreno & OpenCL announcement: Qualcomm blog.
Community guide with flags and tips: llama.cpp guide.
Example environment tweaks for OpenCL runs: Kaggle run notes.

What to try now: pick your hardware, follow the relevant subsection above, build with the matching backend, and run a small test model using --n-gpu-layers 8 to confirm behaviour. If you hit a specific error, open a GitHub issue with logs — the community moves fast.