AI
6 min read

Gemini Live API Tutorial: Build a Voice AI

Build a real-time voice assistant with Gemini Live API. Follow 5 steps, copy code, and ship with a handy production checklist.

Gemini Live API Tutorial: Build a Voice AI

Gemini Live API Tutorial: Build a Voice AI

You’ll build a small, real-time voice assistant in your browser. We’ll stream your mic to Gemini and play voice replies back with low delay. You can also interrupt the bot mid-sentence. We’ll keep it simple and focused.

What you’ll build (in 5 steps)

Quick notes and limits

How it works (simple view)

Part What it does Why it matters
Live session Opens a bidirectional stream Both sides can talk and interrupt
Web Audio API Grabs mic and plays audio Low delay, runs in the browser
WebSocket/gRPC Sends and receives chunks Real-time flow (not request response)
VAD Auto-detects when you speak Cleaner turns

Prerequisites

  • Google Cloud project with Gemini access. See Vertex AI setup.
  • API key or auth token (server-side recommended).
  • Browser with mic permission, Node.js 18+ for local dev.
  • Headphones help prevent echo.

Step 1: Project setup

  1. Create a new folder and a simple static page that serves over HTTPS (for mic access). You can use a tiny Node server or your favorite dev server.
  2. Store your API key on the server. Don’t expose secrets in the browser.

Minimal file layout

// project
//   public/index.html
//   public/app.js
//   server.js (proxies auth)

Step 2: Open a Live session

We’ll show a simple WebSocket-style client. Endpoints and headers can change. Always check the official Firebase Live API or Vertex guide for the current URL, auth, and audio formats.

// public/app.js
class LiveClient {
  constructor({ url, token, model }) {
    this.url = url; // e.g., your server proxy that signs and connects
    this.token = token; // fetched from your server
    this.model = model; // e.g., "gemini-2.0" variant
    this.ws = null;
    this.onAudio = () => {};
    this.onText = () => {};
    this.onOpen = () => {};
    this.onClose = () => {};
    this.onError = (e) => console.error(e);
  }
  connect() {
    this.ws = new WebSocket(`${this.url}?model=${encodeURIComponent(this.model)}&token=${encodeURIComponent(this.token)}`);
    this.ws.binaryType = 'arraybuffer';
    this.ws.onopen = () => this.onOpen();
    this.ws.onclose = () => this.onClose();
    this.ws.onerror = (e) => this.onError(e);
    this.ws.onmessage = (evt) => {
      // Example: audio chunks as ArrayBuffer, text as JSON
      if (evt.data instanceof ArrayBuffer) {
        this.onAudio(evt.data);
      } else {
        try {
          const msg = JSON.parse(evt.data);
          if (msg.type === 'text') this.onText(msg.text);
        } catch {}
      }
    };
  }
  sendText(text) {
    if (!this.ws || this.ws.readyState !== 1) return;
    this.ws.send(JSON.stringify({ type: 'input_text', text }));
  }
  sendAudioChunk(arrayBuffer) {
    if (!this.ws || this.ws.readyState !== 1) return;
    this.ws.send(arrayBuffer);
  }
  stopResponse() {
    if (!this.ws || this.ws.readyState !== 1) return;
    this.ws.send(JSON.stringify({ type: 'stop_output' })); // check docs for exact signal
  }
  close() {
    this.ws?.close(1000, 'client_done');
  }
}

Step 3: Capture mic audio (Web Audio API)

We’ll downsample to 16 kHz PCM16. This is a common format for streaming voice. Check audio formats in the docs.

// public/app.js (add)
class AudioRecorder {
  async start(onChunk) {
    this.stream = await navigator.mediaDevices.getUserMedia({ audio: true });
    this.ctx = new (window.AudioContext || window.webkitAudioContext)({ sampleRate: 16000 });
    const src = this.ctx.createMediaStreamSource(this.stream);
    const processor = this.ctx.createScriptProcessor(4096, 1, 1);
    src.connect(processor);
    processor.connect(this.ctx.destination);
    processor.onaudioprocess = (e) => {
      const input = e.inputBuffer.getChannelData(0);
      const pcm16 = new Int16Array(input.length);
      for (let i = 0; i < input.length; i++) {
        const s = Math.max(-1, Math.min(1, input[i]));
        pcm16[i] = s < 0 ? s * 0x8000 : s * 0x7FFF;
      }
      onChunk(pcm16.buffer);
    };
  }
  async stop() {
    this.stream?.getTracks().forEach(t => t.stop());
    await this.ctx?.close();
  }
}

Step 4: Play Gemini’s audio reply

We’ll play raw PCM16. Your API may return Opus or other formats. Adjust decode logic to match the docs.

// public/app.js (add)
class AudioPlayer {
  constructor() {
    this.ctx = new (window.AudioContext || window.webkitAudioContext)({ sampleRate: 16000 });
    this.queue = [];
    this.playing = false;
  }
  async enqueuePcm16(arrayBuffer) {
    const pcm16 = new Int16Array(arrayBuffer);
    const float32 = new Float32Array(pcm16.length);
    for (let i = 0; i < pcm16.length; i++) float32[i] = Math.max(-1, Math.min(1, pcm16[i] / 0x7FFF));
    const buf = this.ctx.createBuffer(1, float32.length, 16000);
    buf.copyToChannel(float32, 0);
    this.queue.push(buf);
    if (!this.playing) this._drain();
  }
  _drain() {
    const next = this.queue.shift();
    if (!next) { this.playing = false; return; }
    this.playing = true;
    const src = this.ctx.createBufferSource();
    src.buffer = next;
    src.connect(this.ctx.destination);
    src.onended = () => this._drain();
    src.start();
  }
  stop() {
    this.ctx.close();
    this.ctx = new (window.AudioContext || window.webkitAudioContext)({ sampleRate: 16000 });
    this.queue = [];
    this.playing = false;
  }
}

Step 5: Wire it together with interrupt

When the user starts speaking, stop the bot’s audio and send mic chunks. This creates a natural back-and-forth feel (barge-in).

// public/app.js (usage)
const client = new LiveClient({ url: '/live-proxy', token: '<fetched-from-server>', model: 'gemini-live' });
const recorder = new AudioRecorder();
const player = new AudioPlayer();

client.onOpen = () => console.log('Live connected');
client.onAudio = (ab) => player.enqueuePcm16(ab);
client.onText = (t) => console.log('Gemini:', t);

client.connect();

// Start talking to the model
async function startTalking() {
  // interrupt: stop model audio output before sending new input
  player.stop();
  client.stopResponse();
  await recorder.start((chunk) => client.sendAudioChunk(chunk));
}

// Stop sending audio
async function stopTalking() {
  await recorder.stop();
}

// Optional: send text too
function sendTextPrompt(text) {
  player.stop();
  client.stopResponse();
  client.sendText(text);
}

Test plan

  1. Open the page over HTTPS. Allow mic.
  2. Say a short question. You should hear a quick voice reply.
  3. Start talking again while the bot is speaking. The bot should pause, then answer your new question.

Troubleshooting

  • Mic not working in the browser? Check permissions and see this thread on mic issues.
  • Want to stop voice output on device? See how to turn off voice.
  • Slow or missed wake-ups? Community notes a Gemini Live issue on some devices. Try wired headphones and reduce background noise.
  • Session ends early? Keep inputs small. The docs warn about large chunks and session time limits in the Live API page.

Production-Readiness Checklist

  • Auth: Keep API keys server-side. Mint short-lived tokens. Enforce CORS.
  • Latency: Use small audio chunks (100–200 ms). Keep WebSocket alive with pings.
  • Interrupts: Always stop playback before sending new input. Clear audio queues.
  • Resilience: Auto-retry on transient drops with backoff. Recreate AudioContext on unlock.
  • Limits: Respect session length and rate limits (see Live API docs).
  • Costs: Log session start/stop and audio seconds. Sample usage in staging.
  • Privacy: Don’t send PII unless users consent. Follow the Terms of Service.
  • Formats: Match the model’s required audio format (PCM/Opus). Test end-to-end on real devices.
  • Transcription: If needed, pair a third-party STT as Live may not expose transcripts yet. See this transcription article.

FAQ

Is this REST?

No. It’s a stream. Unlike REST, both sides can send data anytime. For a good explainer, read this post on WebSockets and audio.

Can I count tokens?

The Live API doesn’t support CountTokens. Track usage by session time and audio seconds.

Does it do transcription?

Live may not expose transcripts yet. If you need text, add an STT layer. You can build a small service to send mic audio to STT, then forward text to Gemini.

Which audio format?

Check the docs for supported formats and sample rates. 16 kHz mono PCM16 is a safe starting point. Some setups return Opus. Adjust your decoder.

Why Live beats simple REST for voice

Feature Live API REST
Turn-taking Natural, interruptible Strict request response
Latency Low, chunked Higher, full request needed
Audio I/O Built for streams Extra steps

Next steps

  • Prototype prompts in Vertex AI Studio and Live guide.
  • Add barge-in polish: stop TTS, send new input fast, then resume playback.
  • Ship with the checklist above. Start small, measure latency, then iterate.

Recap

You set up a live connection, streamed mic audio, got voice replies, and added interrupt. With a few guards and tests, this can go to production. Keep docs handy: Firebase Live API and Vertex AI Live.

Gemini Live APIVoice assistant

Related Articles

More insights you might find interesting