What is covered in Gemini Live API Tutorial: Build a Voice AI?

Build a real-time voice assistant with Gemini Live API. Follow 5 steps, copy code, and ship with a handy production checklist.

Gemini Live API Tutorial: Build a Voice AI

You’ll build a small, real-time voice assistant in your browser. We’ll stream your mic to Gemini and play voice replies back with low delay. You can also interrupt the bot mid-sentence. We’ll keep it simple and focused.

What you’ll build (in 5 steps)

Open a live session with the Gemini Live API or the Vertex AI Live guide.
Capture mic audio with the Web Audio API and WebSockets.
Stream audio to Gemini. Get audio replies back.
Let users interrupt the model while it’s speaking.
Close the session cleanly. Follow a production checklist.

Quick notes and limits

Low-latency, two-way voice is the point. See the overview: Gemini Live.
Voice activity detection (VAD) is on by default, per Firebase Live API docs.
Sessions time out (default ~30 minutes). Token counting isn’t available on Live, noted in the docs.
Transcription may not be built-in yet. You can pair with an external STT (see audio transcription on Google Cloud).
Follow the Gemini API Terms of Service.

How it works (simple view)

Part	What it does	Why it matters
Live session	Opens a bidirectional stream	Both sides can talk and interrupt
Web Audio API	Grabs mic and plays audio	Low delay, runs in the browser
WebSocket/gRPC	Sends and receives chunks	Real-time flow (not request response)
VAD	Auto-detects when you speak	Cleaner turns

Prerequisites

Google Cloud project with Gemini access. See Vertex AI setup.
API key or auth token (server-side recommended).
Browser with mic permission, Node.js 18+ for local dev.
Headphones help prevent echo.

Step 1: Project setup

Create a new folder and a simple static page that serves over HTTPS (for mic access). You can use a tiny Node server or your favorite dev server.
Store your API key on the server. Don’t expose secrets in the browser.

Minimal file layout

// project
//   public/index.html
//   public/app.js
//   server.js (proxies auth)

Step 2: Open a Live session

We’ll show a simple WebSocket-style client. Endpoints and headers can change. Always check the official Firebase Live API or Vertex guide for the current URL, auth, and audio formats.

// public/app.js
class LiveClient {
  constructor({ url, token, model }) {
    this.url = url; // e.g., your server proxy that signs and connects
    this.token = token; // fetched from your server
    this.model = model; // e.g., "gemini-2.0" variant
    this.ws = null;
    this.onAudio = () => {};
    this.onText = () => {};
    this.onOpen = () => {};
    this.onClose = () => {};
    this.onError = (e) => console.error(e);
  }
  connect() {
    this.ws = new WebSocket(`${this.url}?model=${encodeURIComponent(this.model)}&token=${encodeURIComponent(this.token)}`);
    this.ws.binaryType = 'arraybuffer';
    this.ws.onopen = () => this.onOpen();
    this.ws.onclose = () => this.onClose();
    this.ws.onerror = (e) => this.onError(e);
    this.ws.onmessage = (evt) => {
      // Example: audio chunks as ArrayBuffer, text as JSON
      if (evt.data instanceof ArrayBuffer) {
        this.onAudio(evt.data);
      } else {
        try {
          const msg = JSON.parse(evt.data);
          if (msg.type === 'text') this.onText(msg.text);
        } catch {}
      }
    };
  }
  sendText(text) {
    if (!this.ws || this.ws.readyState !== 1) return;
    this.ws.send(JSON.stringify({ type: 'input_text', text }));
  }
  sendAudioChunk(arrayBuffer) {
    if (!this.ws || this.ws.readyState !== 1) return;
    this.ws.send(arrayBuffer);
  }
  stopResponse() {
    if (!this.ws || this.ws.readyState !== 1) return;
    this.ws.send(JSON.stringify({ type: 'stop_output' })); // check docs for exact signal
  }
  close() {
    this.ws?.close(1000, 'client_done');
  }
}

Step 3: Capture mic audio (Web Audio API)

We’ll downsample to 16 kHz PCM16. This is a common format for streaming voice. Check audio formats in the docs.

// public/app.js (add)
class AudioRecorder {
  async start(onChunk) {
    this.stream = await navigator.mediaDevices.getUserMedia({ audio: true });
    this.ctx = new (window.AudioContext || window.webkitAudioContext)({ sampleRate: 16000 });
    const src = this.ctx.createMediaStreamSource(this.stream);
    const processor = this.ctx.createScriptProcessor(4096, 1, 1);
    src.connect(processor);
    processor.connect(this.ctx.destination);
    processor.onaudioprocess = (e) => {
      const input = e.inputBuffer.getChannelData(0);
      const pcm16 = new Int16Array(input.length);
      for (let i = 0; i < input.length; i++) {
        const s = Math.max(-1, Math.min(1, input[i]));
        pcm16[i] = s < 0 ? s * 0x8000 : s * 0x7FFF;
      }
      onChunk(pcm16.buffer);
    };
  }
  async stop() {
    this.stream?.getTracks().forEach(t => t.stop());
    await this.ctx?.close();
  }
}

Step 4: Play Gemini’s audio reply

We’ll play raw PCM16. Your API may return Opus or other formats. Adjust decode logic to match the docs.

// public/app.js (add)
class AudioPlayer {
  constructor() {
    this.ctx = new (window.AudioContext || window.webkitAudioContext)({ sampleRate: 16000 });
    this.queue = [];
    this.playing = false;
  }
  async enqueuePcm16(arrayBuffer) {
    const pcm16 = new Int16Array(arrayBuffer);
    const float32 = new Float32Array(pcm16.length);
    for (let i = 0; i < pcm16.length; i++) float32[i] = Math.max(-1, Math.min(1, pcm16[i] / 0x7FFF));
    const buf = this.ctx.createBuffer(1, float32.length, 16000);
    buf.copyToChannel(float32, 0);
    this.queue.push(buf);
    if (!this.playing) this._drain();
  }
  _drain() {
    const next = this.queue.shift();
    if (!next) { this.playing = false; return; }
    this.playing = true;
    const src = this.ctx.createBufferSource();
    src.buffer = next;
    src.connect(this.ctx.destination);
    src.onended = () => this._drain();
    src.start();
  }
  stop() {
    this.ctx.close();
    this.ctx = new (window.AudioContext || window.webkitAudioContext)({ sampleRate: 16000 });
    this.queue = [];
    this.playing = false;
  }
}

Step 5: Wire it together with interrupt

When the user starts speaking, stop the bot’s audio and send mic chunks. This creates a natural back-and-forth feel (barge-in).

// public/app.js (usage)
const client = new LiveClient({ url: '/live-proxy', token: '<fetched-from-server>', model: 'gemini-live' });
const recorder = new AudioRecorder();
const player = new AudioPlayer();

client.onOpen = () => console.log('Live connected');
client.onAudio = (ab) => player.enqueuePcm16(ab);
client.onText = (t) => console.log('Gemini:', t);

client.connect();

// Start talking to the model
async function startTalking() {
  // interrupt: stop model audio output before sending new input
  player.stop();
  client.stopResponse();
  await recorder.start((chunk) => client.sendAudioChunk(chunk));
}

// Stop sending audio
async function stopTalking() {
  await recorder.stop();
}

// Optional: send text too
function sendTextPrompt(text) {
  player.stop();
  client.stopResponse();
  client.sendText(text);
}

Test plan

Open the page over HTTPS. Allow mic.
Say a short question. You should hear a quick voice reply.
Start talking again while the bot is speaking. The bot should pause, then answer your new question.

Troubleshooting

Mic not working in the browser? Check permissions and see this thread on mic issues.
Want to stop voice output on device? See how to turn off voice.
Slow or missed wake-ups? Community notes a Gemini Live issue on some devices. Try wired headphones and reduce background noise.
Session ends early? Keep inputs small. The docs warn about large chunks and session time limits in the Live API page.

Production-Readiness Checklist

Auth: Keep API keys server-side. Mint short-lived tokens. Enforce CORS.
Latency: Use small audio chunks (100–200 ms). Keep WebSocket alive with pings.
Interrupts: Always stop playback before sending new input. Clear audio queues.
Resilience: Auto-retry on transient drops with backoff. Recreate AudioContext on unlock.
Limits: Respect session length and rate limits (see Live API docs).
Costs: Log session start/stop and audio seconds. Sample usage in staging.
Privacy: Don’t send PII unless users consent. Follow the Terms of Service.
Formats: Match the model’s required audio format (PCM/Opus). Test end-to-end on real devices.
Transcription: If needed, pair a third-party STT as Live may not expose transcripts yet. See this transcription article.

FAQ

Is this REST?

No. It’s a stream. Unlike REST, both sides can send data anytime. For a good explainer, read this post on WebSockets and audio.

Can I count tokens?

The Live API doesn’t support CountTokens. Track usage by session time and audio seconds.

Does it do transcription?

Live may not expose transcripts yet. If you need text, add an STT layer. You can build a small service to send mic audio to STT, then forward text to Gemini.

Which audio format?

Check the docs for supported formats and sample rates. 16 kHz mono PCM16 is a safe starting point. Some setups return Opus. Adjust your decoder.

Why Live beats simple REST for voice

Feature	Live API	REST
Turn-taking	Natural, interruptible	Strict request response
Latency	Low, chunked	Higher, full request needed
Audio I/O	Built for streams	Extra steps

Next steps

Prototype prompts in Vertex AI Studio and Live guide.
Add barge-in polish: stop TTS, send new input fast, then resume playback.
Ship with the checklist above. Start small, measure latency, then iterate.

Recap

You set up a live connection, streamed mic audio, got voice replies, and added interrupt. With a few guards and tests, this can go to production. Keep docs handy: Firebase Live API and Vertex AI Live.

Gemini Live API Tutorial: Build a Voice AI

Gemini Live API Tutorial: Build a Voice AI

What you’ll build (in 5 steps)

Quick notes and limits

How it works (simple view)

Prerequisites

Step 1: Project setup

Minimal file layout

Step 2: Open a Live session

Step 3: Capture mic audio (Web Audio API)

Step 4: Play Gemini’s audio reply

Step 5: Wire it together with interrupt

Test plan

Troubleshooting

Production-Readiness Checklist

FAQ

Is this REST?

Can I count tokens?

Does it do transcription?

Which audio format?

Why Live beats simple REST for voice

Next steps

Recap

Related Articles

Claude Code: The Productivity Playbook

Egocentric-10K vs Ego4D: A Complete Comparison

Claude Prompt Engineering: The Complete Playbook