Language learning apps often focus on vocabulary drills and flashcards. But real language mastery happens in conversation, in situations like ordering coffee, asking for directions, or buying a bus ticket.

That idea inspired Immergo — an immersive AI language learning platform powered by Google’s Gemini Live API that simulates real-world conversations using real-time voice interaction.

In this article, I’ll walk through how we built Immergo, the architecture behind it, and what we learned during development.


Inspiration

Most language learners struggle with one major problem:

They understand the language but freeze when it’s time to speak.

Traditional apps don’t simulate the pressure and unpredictability of real conversations.

We wanted to create a system where learners could practice with an AI that behaves like a real person in a real scenario — a bus driver, a shopkeeper, or a friendly neighbor.

Using Gemini Live, we realized we could build something closer to real-life immersion than traditional language apps.


What Immergo Does

Immergo is a real-time conversational language simulator.

Instead of memorizing vocabulary, users participate in interactive missions where they must speak the target language to progress.

Core Features

1. Missions & Roleplay

Users choose structured scenarios like:

  • Buying a bus ticket

  • Ordering food

  • Asking for directions

  • Meeting a neighbor

Each mission has a clear objective the learner must accomplish.

The AI adopts a persona appropriate to the scenario.

Examples:

  • Bus Driver

  • CafĂ© Barista

  • Hotel Receptionist

  • Friendly Neighbor


2. Two Learning Modes

Teacher Mode

The AI acts as a helpful tutor.

It can:

  • Translate phrases

  • Explain grammar

  • Suggest better sentences

  • Provide hints in the learner’s native language

Perfect for beginners.


Immersive Mode

This is where Immergo becomes powerful.

In immersive mode:

  • The AI refuses to switch languages

  • The learner must speak the target language

  • The scenario continues only when the learner communicates successfully

We call this the “No Free Rides” rule.


3. Native Language Support

Users can select their native language, allowing the AI to provide contextual help and explanations when needed.

This makes learning accessible even for beginners.


4. Real-Time Voice Conversations

Users speak naturally into their microphone.

The AI responds instantly with low-latency audio, creating a conversation that feels natural.

This is powered by Gemini Live streaming.


5. Performance Scoring

After completing a mission, Immergo evaluates the learner.

We introduced three fluency levels:

  • Tiro – Beginner

  • Proficiens – Intermediate

  • Peritus – Advanced

The AI provides actionable feedback to improve pronunciation, fluency, and vocabulary.


Tech Stack

Immergo is built with a lightweight but powerful stack.

Frontend

  • Vanilla JavaScript

  • Vite

  • Web Audio API

  • WebSocket

Backend

  • Python

  • FastAPI

  • Google GenAI SDK

AI Model

  • Gemini Live via Vertex AI

Communication

  • WebSockets for full-duplex audio streaming


System Architecture

The architecture is built around real-time streaming.

User Microphone
       │
       ▼
Web Audio API
       │
       ▼
WebSocket Stream
       │
       ▼
FastAPI Backend
       │
       ▼
Gemini Live API
       │
       ▼
AI Response (Audio + Text)
       │
       ▼
Browser Audio Playback

Key principle:

Audio flows continuously between the user and the AI.

This allows conversations to feel instant and natural.


The Core: Gemini Live Integration

The heart of Immergo is the GeminiLive session manager.

It connects the backend to the Gemini Live streaming interface and handles:

  • Audio streaming

  • Transcriptions

  • Tool calls

  • AI responses

The connection is established using the async Gemini client.

async with self.client.aio.live.connect(model=self.model, config=config) as session:

Once connected, we create async tasks that handle the data streams.


Streaming User Audio to Gemini

Audio from the browser arrives through a WebSocket and is placed into a queue.

The backend continuously streams this audio to Gemini.

async def send_audio():
    while True:
        chunk = await audio_input_queue.get()
        await session.send_realtime_input(
            audio=types.Blob(data=chunk, mime_type=f"audio/pcm;rate={self.input_sample_rate}")
        )

This enables real-time speech recognition and understanding.


Receiving AI Audio Responses

Gemini Live sends audio responses in chunks.

Our system forwards these chunks back to the browser for playback.

if part.inline_data:
    await audio_output_callback(part.inline_data.data)

This creates a fluid voice conversation between the user and the AI persona.


Real-Time Transcriptions

To help learners understand the conversation, we enable both:

  • Input transcription (what the user said)

  • Output transcription (what the AI said)

config_args["output_audio_transcription"] = types.AudioTranscriptionConfig()
config_args["input_audio_transcription"] = types.AudioTranscriptionConfig()

These transcriptions allow us to:

  • Display subtitles

  • Analyze fluency

  • Provide feedback


AI Personas

Each mission configures Gemini with a system instruction.

Example:

You are a friendly bus driver in Madrid.
The user wants to buy a ticket.
Only speak Spanish.
Keep responses short and natural.

This gives the AI a consistent personality and context.


Tool Calling for Game Logic

We also implemented function calling so the AI can interact with the mission engine.

Example tools:

  • complete_mission()

  • give_hint()

  • evaluate_fluency()

The backend registers tools like this:

def register_tool(self, func: Callable):
    self.tool_mapping[func.__name__] = func

When Gemini calls a function, the backend executes it and returns the result.


Handling Interruptions

Natural conversations include interruptions.

Gemini Live signals when the user interrupts the AI.

if server_content.interrupted:
    await event_queue.put({"type": "interrupted"})

The client then stops playback and listens to the user.

This makes the experience feel much more human.


Challenges We Ran Into

1. Real-Time Audio Latency

Streaming audio while maintaining low latency was tricky.

Solutions included:

  • Smaller audio chunks

  • Efficient WebSocket streaming

  • Async queues


2. Interrupt Handling

Users often speak over the AI.

We had to ensure the system could:

  • Stop playback

  • Immediately switch back to listening


3. Persona Consistency

Without strict instructions, the AI sometimes broke character.

We solved this using structured system prompts and strict mission constraints.


Accomplishments We're Proud Of

We successfully built:

  • A real-time AI conversation engine

  • A roleplay-based language learning system

  • A low-latency voice AI experience

  • A mission-driven learning structure

The biggest achievement was making the AI feel like a real conversational partner.


What We Learned

Building Immergo taught us several lessons.

Voice AI changes everything

When AI speaks instead of typing, the experience becomes far more engaging.


Context matters

Roleplay scenarios dramatically improve language retention.

Learners remember phrases better when they are tied to real situations.


Gemini Live is powerful

The streaming capabilities make it possible to build interactive AI experiences that feel natural.


What’s Next for Immergo

We’re excited about the future roadmap.

Upcoming features include:

  • Multiplayer roleplay scenarios

  • Visual environments (VR / AR)

  • Adaptive difficulty missions

  • Pronunciation scoring

  • AI-driven learning paths

Ultimately, our goal is simple:

Make language learning feel like living in the language.

#GeminiLiveAgentChallenge