Gemini Live API Tutorial: Build a Voice AI
Build a real-time voice assistant with Gemini Live API. Follow 5 steps, copy code, and ship with a handy production checklist.

Gemini Live API Tutorial: Build a Voice AI
You’ll build a small, real-time voice assistant in your browser. We’ll stream your mic to Gemini and play voice replies back with low delay. You can also interrupt the bot mid-sentence. We’ll keep it simple and focused.
What you’ll build (in 5 steps)
- Open a live session with the Gemini Live API or the Vertex AI Live guide.
- Capture mic audio with the Web Audio API and WebSockets.
- Stream audio to Gemini. Get audio replies back.
- Let users interrupt the model while it’s speaking.
- Close the session cleanly. Follow a production checklist.
Quick notes and limits
- Low-latency, two-way voice is the point. See the overview: Gemini Live.
- Voice activity detection (VAD) is on by default, per Firebase Live API docs.
- Sessions time out (default ~30 minutes). Token counting isn’t available on Live, noted in the docs.
- Transcription may not be built-in yet. You can pair with an external STT (see audio transcription on Google Cloud).
- Follow the Gemini API Terms of Service.
How it works (simple view)
Part | What it does | Why it matters |
---|---|---|
Live session | Opens a bidirectional stream | Both sides can talk and interrupt |
Web Audio API | Grabs mic and plays audio | Low delay, runs in the browser |
WebSocket/gRPC | Sends and receives chunks | Real-time flow (not request response) |
VAD | Auto-detects when you speak | Cleaner turns |
Prerequisites
- Google Cloud project with Gemini access. See Vertex AI setup.
- API key or auth token (server-side recommended).
- Browser with mic permission, Node.js 18+ for local dev.
- Headphones help prevent echo.
Step 1: Project setup
- Create a new folder and a simple static page that serves over HTTPS (for mic access). You can use a tiny Node server or your favorite dev server.
- Store your API key on the server. Don’t expose secrets in the browser.
Minimal file layout
// project
// public/index.html
// public/app.js
// server.js (proxies auth)
Step 2: Open a Live session
We’ll show a simple WebSocket-style client. Endpoints and headers can change. Always check the official Firebase Live API or Vertex guide for the current URL, auth, and audio formats.
// public/app.js
class LiveClient {
constructor({ url, token, model }) {
this.url = url; // e.g., your server proxy that signs and connects
this.token = token; // fetched from your server
this.model = model; // e.g., "gemini-2.0" variant
this.ws = null;
this.onAudio = () => {};
this.onText = () => {};
this.onOpen = () => {};
this.onClose = () => {};
this.onError = (e) => console.error(e);
}
connect() {
this.ws = new WebSocket(`${this.url}?model=${encodeURIComponent(this.model)}&token=${encodeURIComponent(this.token)}`);
this.ws.binaryType = 'arraybuffer';
this.ws.onopen = () => this.onOpen();
this.ws.onclose = () => this.onClose();
this.ws.onerror = (e) => this.onError(e);
this.ws.onmessage = (evt) => {
// Example: audio chunks as ArrayBuffer, text as JSON
if (evt.data instanceof ArrayBuffer) {
this.onAudio(evt.data);
} else {
try {
const msg = JSON.parse(evt.data);
if (msg.type === 'text') this.onText(msg.text);
} catch {}
}
};
}
sendText(text) {
if (!this.ws || this.ws.readyState !== 1) return;
this.ws.send(JSON.stringify({ type: 'input_text', text }));
}
sendAudioChunk(arrayBuffer) {
if (!this.ws || this.ws.readyState !== 1) return;
this.ws.send(arrayBuffer);
}
stopResponse() {
if (!this.ws || this.ws.readyState !== 1) return;
this.ws.send(JSON.stringify({ type: 'stop_output' })); // check docs for exact signal
}
close() {
this.ws?.close(1000, 'client_done');
}
}
Step 3: Capture mic audio (Web Audio API)
We’ll downsample to 16 kHz PCM16. This is a common format for streaming voice. Check audio formats in the docs.
// public/app.js (add)
class AudioRecorder {
async start(onChunk) {
this.stream = await navigator.mediaDevices.getUserMedia({ audio: true });
this.ctx = new (window.AudioContext || window.webkitAudioContext)({ sampleRate: 16000 });
const src = this.ctx.createMediaStreamSource(this.stream);
const processor = this.ctx.createScriptProcessor(4096, 1, 1);
src.connect(processor);
processor.connect(this.ctx.destination);
processor.onaudioprocess = (e) => {
const input = e.inputBuffer.getChannelData(0);
const pcm16 = new Int16Array(input.length);
for (let i = 0; i < input.length; i++) {
const s = Math.max(-1, Math.min(1, input[i]));
pcm16[i] = s < 0 ? s * 0x8000 : s * 0x7FFF;
}
onChunk(pcm16.buffer);
};
}
async stop() {
this.stream?.getTracks().forEach(t => t.stop());
await this.ctx?.close();
}
}
Step 4: Play Gemini’s audio reply
We’ll play raw PCM16. Your API may return Opus or other formats. Adjust decode logic to match the docs.
// public/app.js (add)
class AudioPlayer {
constructor() {
this.ctx = new (window.AudioContext || window.webkitAudioContext)({ sampleRate: 16000 });
this.queue = [];
this.playing = false;
}
async enqueuePcm16(arrayBuffer) {
const pcm16 = new Int16Array(arrayBuffer);
const float32 = new Float32Array(pcm16.length);
for (let i = 0; i < pcm16.length; i++) float32[i] = Math.max(-1, Math.min(1, pcm16[i] / 0x7FFF));
const buf = this.ctx.createBuffer(1, float32.length, 16000);
buf.copyToChannel(float32, 0);
this.queue.push(buf);
if (!this.playing) this._drain();
}
_drain() {
const next = this.queue.shift();
if (!next) { this.playing = false; return; }
this.playing = true;
const src = this.ctx.createBufferSource();
src.buffer = next;
src.connect(this.ctx.destination);
src.onended = () => this._drain();
src.start();
}
stop() {
this.ctx.close();
this.ctx = new (window.AudioContext || window.webkitAudioContext)({ sampleRate: 16000 });
this.queue = [];
this.playing = false;
}
}
Step 5: Wire it together with interrupt
When the user starts speaking, stop the bot’s audio and send mic chunks. This creates a natural back-and-forth feel (barge-in).
// public/app.js (usage)
const client = new LiveClient({ url: '/live-proxy', token: '<fetched-from-server>', model: 'gemini-live' });
const recorder = new AudioRecorder();
const player = new AudioPlayer();
client.onOpen = () => console.log('Live connected');
client.onAudio = (ab) => player.enqueuePcm16(ab);
client.onText = (t) => console.log('Gemini:', t);
client.connect();
// Start talking to the model
async function startTalking() {
// interrupt: stop model audio output before sending new input
player.stop();
client.stopResponse();
await recorder.start((chunk) => client.sendAudioChunk(chunk));
}
// Stop sending audio
async function stopTalking() {
await recorder.stop();
}
// Optional: send text too
function sendTextPrompt(text) {
player.stop();
client.stopResponse();
client.sendText(text);
}
Test plan
- Open the page over HTTPS. Allow mic.
- Say a short question. You should hear a quick voice reply.
- Start talking again while the bot is speaking. The bot should pause, then answer your new question.
Troubleshooting
- Mic not working in the browser? Check permissions and see this thread on mic issues.
- Want to stop voice output on device? See how to turn off voice.
- Slow or missed wake-ups? Community notes a Gemini Live issue on some devices. Try wired headphones and reduce background noise.
- Session ends early? Keep inputs small. The docs warn about large chunks and session time limits in the Live API page.
Production-Readiness Checklist
- Auth: Keep API keys server-side. Mint short-lived tokens. Enforce CORS.
- Latency: Use small audio chunks (100–200 ms). Keep WebSocket alive with pings.
- Interrupts: Always stop playback before sending new input. Clear audio queues.
- Resilience: Auto-retry on transient drops with backoff. Recreate AudioContext on unlock.
- Limits: Respect session length and rate limits (see Live API docs).
- Costs: Log session start/stop and audio seconds. Sample usage in staging.
- Privacy: Don’t send PII unless users consent. Follow the Terms of Service.
- Formats: Match the model’s required audio format (PCM/Opus). Test end-to-end on real devices.
- Transcription: If needed, pair a third-party STT as Live may not expose transcripts yet. See this transcription article.
FAQ
Is this REST?
No. It’s a stream. Unlike REST, both sides can send data anytime. For a good explainer, read this post on WebSockets and audio.
Can I count tokens?
The Live API doesn’t support CountTokens. Track usage by session time and audio seconds.
Does it do transcription?
Live may not expose transcripts yet. If you need text, add an STT layer. You can build a small service to send mic audio to STT, then forward text to Gemini.
Which audio format?
Check the docs for supported formats and sample rates. 16 kHz mono PCM16 is a safe starting point. Some setups return Opus. Adjust your decoder.
Why Live beats simple REST for voice
Feature | Live API | REST |
---|---|---|
Turn-taking | Natural, interruptible | Strict request response |
Latency | Low, chunked | Higher, full request needed |
Audio I/O | Built for streams | Extra steps |
Next steps
- Prototype prompts in Vertex AI Studio and Live guide.
- Add barge-in polish: stop TTS, send new input fast, then resume playback.
- Ship with the checklist above. Start small, measure latency, then iterate.
Recap
You set up a live connection, streamed mic audio, got voice replies, and added interrupt. With a few guards and tests, this can go to production. Keep docs handy: Firebase Live API and Vertex AI Live.