Run Qwen on Android / Fast setup and fixes
Run Qwen3/Qwen3.5 on Android for offline chat or ADB phone control. Pick q4_0 and fix OOM, slow TPS, and taps.

Short answer (what works today):
- Low-end Android (4GB RAM): Run Qwen3.5-0.8B (or smaller) on-device for offline Q&A and summaries.
- Flagship Android (12GB RAM class): Qwen3 9B can be usable in GGUF
q4_0if you keep context modest and watch thermals. - If you need the phone to take actions: Host a Qwen3-VL/Qwen3.5-VL vision-language model on a PC/GPU and control Android via screenshots +
adb(PhoneDriver-style).
This guide is reliability-first: pick the right setup, apply safe defaults (especially num_ctx), and use the troubleshooting steps to reduce trial-and-error.
1) Choose your path: on-device chat vs remote agent control
Status check: “Qwen on Android” can mean three different setups. Pick one before downloading anything.
- On-device private chat (offline): Use a GGUF-capable Android runner (llama.cpp-class) with a small Qwen3.5 model for privacy and airplane mode.
- Cloud chat client (fastest): Use an Android client app that talks to a hosted Qwen endpoint for quick testing.
- Android automation agent (actions, taps, swipes): Run a VLM on a host machine and loop: screenshot → interpret UI → execute via ADB → repeat.
Forward plan: For offline chat, continue with the on-device sections. For “do things on my phone,” use the agent section and treat Android as the target device.
2) Model selection matrix (RAM, size, quantization)
Status check: Most failures come from oversizing the model or the context window. Start conservative and scale up after stability is proven.
| Phone tier | Typical RAM | Recommended starting model | Quantization guidance | What to expect |
|---|---|---|---|---|
| Low-end | 4GB | Qwen3.5-0.8B | Use 4-bit if available; keep num_ctx small |
Offline summaries, short chats, basic multilingual |
| Mid-range | 6–8GB | Qwen3.5-2B (or 0.8B for speed) | Prefer 4-bit GGUF (example: q4_0) |
Better instruction following; still watch thermals |
| Upper mid / older flagship | 8–12GB | Qwen3.5-4B | 4-bit; avoid huge contexts | More stable reasoning; usable coding help |
| Flagship | 12GB+ | Qwen3 9B | Start with GGUF q4_0; increase context only after stable |
Higher-quality answers; risk of OOM and low TPS |
Notes that prevent wasted downloads:
- Best quantization for mobile: start with 4-bit (for example
q4_0). If you hit OOM, reduce model size before chasing more aggressive quants. - RAM for Qwen 9B: treat 12GB as a practical floor for 9B in 4-bit, and keep context tight.
- Naming mismatch risk: some tooling maps tags to specific builds. Verify which artifact (size, quant, variant) you actually downloaded.
Rollback plan: If you crash or stutter, drop one model tier (9B → 4B → 2B → 0.8B) before changing anything else.
3) Prerequisites checklist (don’t skip these)
Status check: Local inference is sensitive to storage, thermals, and background memory pressure. Prepare the device first to reduce false failures.
- Storage: ensure free space for model + cache (plan 3–10GB depending on model/quant).
- Thermals: remove the case for testing; expect throttling during long runs.
- Battery: plug in and enable performance mode if available.
- For ADB agent: enable Developer Options → USB debugging; install platform-tools on the host machine.
- Permissions: only grant screen capture or storage access to apps you trust.
Forward plan: Once this is green, on-device chat is typically under an hour. Agent control usually takes longer due to ADB and resolution calibration.
4) On-device setup (offline): GGUF runner + safe defaults
Status check: The most common slow-TPS and crash issues come from oversized context and uncapped generation. Make these limits explicit.
Step 1: Download the right artifact (GGUF + quantization)
- Pick a Qwen3/Qwen3.5 model size from the table above.
- Prefer GGUF quantizations (example:
q4_0) for llama.cpp-class runners. - Use instruct variants for chat; base variants are better suited to fine-tuning workflows.
Step 2: Import into your Android runner
Use a reputable Android app that loads GGUF models locally (often llama.cpp-based). Import the GGUF, select the Qwen chat template if available, and run a short smoke test.
Step 3: Apply safe context defaults (prevents instability)
Command/check equivalents (your UI may use different names):
- Context window: set
num_ctxto 2048 (or 4096 on 12GB+ only if stable). - Generation cap: set
num_predict(max new tokens) to 256–512. - Threads: start on auto; reduce threads if heat or throttling is severe.
Operational note: large contexts plus unlimited generation can cause memory spikes and lockups. Keep
num_ctxandnum_predictexplicit.
Step 4: Quick mini-benchmark (5 prompts, log TPS)
Run these prompts back-to-back and record tokens/sec (TPS), temperature behavior, and any OOM. Keep the test consistent so you can compare changes.
- “Summarize this in 5 bullets: <paste 1–2 paragraphs>”
- “Write a short email asking for a meeting next week.”
- “Explain DNS caching like I’m new to networking.”
- “Generate a regex to match ISO 8601 dates.”
- “Translate this to Spanish and back: <one sentence>”
Forward plan: Increase num_ctx only in small steps. If TPS collapses or you OOM, revert to the last known-good settings.
5) Agent setup (PhoneDriver-style): Qwen3-VL + screenshots + ADB
Status check: If you want actions (taps, swipes, navigation), on-device text chat is not enough. You typically need a VLM loop plus ADB execution.
Step 1: Install and verify ADB (host machine)
On your host (Linux/macOS/Windows), install Android platform-tools, then run:
adb version
adb devices
Expected: your device shows as device (not unauthorized).
Step 2: Connect the phone and approve debugging
- Enable Developer Options → USB debugging.
- Plug in USB and accept the “Allow USB debugging” prompt.
- Re-run
adb devicesto confirm it’s authorized.
Step 3: Run the agent UI and calibrate resolution
Many PhoneDriver-like agents provide a web UI and a config file (for example config.json) with parameters like step_delay and max_retries. Calibrate before real tasks.
- Check resolution:
adb shell wm size - Validate taps: run a single tap test and confirm it lands correctly.
- Set conservative retries: start with
max_retries=3 andstep_delay=0.5–1.0s.
Step 4: Run a safe first task (low risk)
Start with non-destructive tasks while you validate screenshot freshness and coordinate mapping.
- “Open Settings and search for Bluetooth. Do not toggle anything.”
- “Open the browser and navigate to example.com.”
Logging requirement: save screenshots and ADB command logs per step. This makes failures reproducible and debuggable.
Agent prompt library (15 tasks with guardrails)
- “Turn on Do Not Disturb for 30 minutes.”
- “Open Calendar and create an event titled ‘Gym’ tomorrow 7pm. Stop before saving and ask for confirmation.”
- “Find the latest OTP in Notifications and copy it to clipboard (do not send it).”
- “Open Messages and draft a reply: ‘Running 10 minutes late’. Do not press send.”
- “Open the camera app and switch to photo mode.”
- “Open Wi‑Fi settings and show the current network name.”
- “Open your notes app and create a note with today’s checklist.”
- “Search the Play Store for ‘password manager’. Do not install anything.”
- “Open Maps and search for ‘coffee’. Do not start navigation.”
- “Open Photos and find screenshots from today.”
- “Increase screen brightness to 60%.”
- “Enable airplane mode, then disable it after 10 seconds.”
- “Check battery percentage and report it.”
- “Open the dialer and type the number; do not place the call.”
- “List the top 5 apps in recent tasks (no interaction beyond viewing).”
Rollback plan: If the agent loops or taps randomly, stop the run, disconnect ADB, and recalibrate resolution/taps before continuing.
6) Troubleshooting decision tree (fast diagnosis + recovery)
Status check: Treat failures like incidents. Stabilize first, then optimize one variable at a time.
A) Out of memory (OOM) on Android
Current status: the app crashes, the model fails to load, or generation stops abruptly.
- Check free RAM: close apps, reboot once, then retry.
- Reduce context: set
num_ctx=2048. - Cap output: set
num_predict=256. - Drop model size: 9B → 4B → 2B → 0.8B.
- Change quantization: if you must keep size, use a smaller-footprint quant (start with 4-bit like
q4_0).
Rollback/forward plan: Roll back to the last model that loads reliably. Move forward by changing only one knob at a time.
B) Slow tokens/sec (TPS)
Current status: generation works but is too slow or degrades over time.
- Check thermals: if the phone is hot, you are throttling.
- Reduce
num_ctx: large contexts can tank TPS. - Reduce threads: fewer threads can stabilize clocks and reduce sustained throttling.
- Shorten prompts: long chat histories inflate compute; summarize and restart sessions.
- Try a smaller model: 4B → 2B is often the best speed/quality trade.
Rollback/forward plan: Return to a smaller model with 2K context until TPS is stable. Increase quality gradually after that baseline holds.
C) ADB device not detected
Current status: the agent can’t see the phone, and adb devices is empty or shows unauthorized.
- Run
adb devicesand confirm the state (deviceexpected). - Toggle USB debugging off/on, replug USB, and approve the prompt.
- Restart ADB:
adb kill-serverthenadb start-server. - Swap cable/port and avoid charge-only cables.
- Close other tools that may hold the device (Android Studio, other ADB sessions).
Rollback/forward plan: Return to a clean ADB session (one host, one cable) before debugging the agent stack.
D) Wrong tap coordinates
Current status: the agent taps the wrong UI element or misses buttons.
- Verify resolution: run
adb shell wm sizeand match it in agent config. - Check display scaling: non-default display size or font can shift coordinate mapping.
- Recalibrate: re-run the agent’s calibration or auto-detect if supported.
- Slow it down: increase
step_delayso the UI settles before taps. - Confirm screenshot freshness: stale screenshots cause correct actions on the wrong screen.
Rollback/forward plan: Use manual coordinate tests until taps land correctly. Re-enable the full loop with limited retries.
7) Safety and privacy (screenshots are data)
Status check: Agents often capture screenshots and store logs that can include OTPs, messages, and financial data. Handle this as sensitive data.
- Set log retention: delete screenshots after runs unless needed for QA evidence.
- Scope permissions: avoid broad storage access unless required.
- Use confirmation gates: require “ask before send/buy/install” in prompts.
- Separate test accounts: use non-production accounts for QA/RPA flows.
Forward plan: Apply least privilege, short retention, and auditable logs as default operating practice.
8) Next steps (upgrade safely)
Status check: After you have a stable baseline, improvements become predictable. Change one variable at a time and compare against your benchmark.
- Upgrade the model: move from 0.8B → 2B → 4B when device headroom is proven.
- Extend automation: add structured tool calling on the host side, but keep ADB execution gated.
- Standardize benchmarks: keep the 5-prompt benchmark and compare TPS and error rate per change.
Rollback/forward plan: Keep one known-good model + config saved. Revert immediately if stability regresses.
FAQ
How to run Qwen3 9B q4_0 on Android?
Use a GGUF-capable Android runner, download Qwen3 9B in q4_0, and start with num_ctx=2048 and num_predict=256. Increase context gradually only after stable runs. If you OOM, drop to 4B.
Can I run Qwen3.5 0.8B on Android 4GB RAM?
Yes. It is the recommended starting point for 4GB devices. Keep context small, close background apps, and expect best results on short tasks.
Is there an LMStudio Qwen3 GGUF Android alternative?
On Android, use a mobile GGUF runner rather than desktop tools like LMStudio. Look for the same controls: num_ctx, max new tokens, threads, and chat template support.
Can I use Ollama Qwen3 num_ctx recommended settings on Android?
The principle carries over: keep num_ctx explicit and modest to avoid instability. Start at 2048 and increase only after stable runs.
Should I use ExecuTorch export or MNN to run a Qwen model on Android?
If you are shipping an embedded app and need tight integration, ExecuTorch/MNN can fit. For fastest setup and iteration, start with GGUF runners (offline chat) or a host-run VLM + ADB agent, then optimize later.
Structured data (copy/paste)
HowTo JSON-LD
{
"@context": "https://schema.org",
"@type": "HowTo",
"name": "Run Qwen on Android / Fast setup and fixes",
"description": "Run Qwen3/Qwen3.5 on Android for on-device chat or set up a Qwen VLM agent to control an Android phone via ADB. Includes model sizing, q4_0 guidance, and troubleshooting.",
"step": [
{
"@type": "HowToStep",
"name": "Choose a path",
"text": "Decide between on-device GGUF chat, cloud client, or host-run VLM + ADB agent."
},
{
"@type": "HowToStep",
"name": "Select model and quantization",
"text": "Pick 0.8B/2B/4B/9B based on RAM. Start with 4-bit (q4_0) and modest context."
},
{
"@type": "HowToStep",
"name": "Configure safe defaults",
"text": "Set num_ctx=2048 and cap generation (num_predict=256–512) to prevent instability."
},
{
"@type": "HowToStep",
"name": "Validate with a mini benchmark",
"text": "Run 5 test prompts and record tokens/sec, thermals, and errors."
},
{
"@type": "HowToStep",
"name": "Troubleshoot and rollback",
"text": "If OOM/slow TPS/wrong taps occur, apply the decision tree and revert to last known-good config."
}
]
}
FAQPage JSON-LD
{
"@context": "https://schema.org",
"@type": "FAQPage",
"mainEntity": [
{
"@type": "Question",
"name": "How to run Qwen3 9B q4_0 on Android?",
"acceptedAnswer": {
"@type": "Answer",
"text": "Use a GGUF-capable Android runner, start with q4_0, set num_ctx=2048 and num_predict=256, then increase gradually if stable."
}
},
{
"@type": "Question",
"name": "Can I run Qwen3.5 0.8B on Android 4GB RAM?",
"acceptedAnswer": {
"@type": "Answer",
"text": "Yes. It is the best starting point for 4GB devices. Keep context small and close background apps."
}
},
{
"@type": "Question",
"name": "How do I fix ADB device not detected for an AI agent?",
"acceptedAnswer": {
"@type": "Answer",
"text": "Approve USB debugging, replug the cable, and restart ADB with adb kill-server && adb start-server, then re-run adb devices."
}
}
]
}
Internal links to go deeper:
GGUF quantization explained (q4_0 vs q4_K_M), ADB basics for automation, Qwen3 vs Qwen3.5 for agent workflows.


