Language learning apps often focus on vocabulary drills and flashcards. But real language mastery happens in conversation, in situations like ordering coffee, asking for directions, or buying a bus ticket.
That idea inspired Immergo — an immersive AI language learning platform powered by Google’s Gemini Live API that simulates real-world conversations using real-time voice interaction.
In this article, I’ll walk through how we built Immergo, the architecture behind it, and what we learned during development.
Inspiration
Most language learners struggle with one major problem:
They understand the language but freeze when it’s time to speak.
Traditional apps don’t simulate the pressure and unpredictability of real conversations.
We wanted to create a system where learners could practice with an AI that behaves like a real person in a real scenario — a bus driver, a shopkeeper, or a friendly neighbor.
Using Gemini Live, we realized we could build something closer to real-life immersion than traditional language apps.
What Immergo Does
Immergo is a real-time conversational language simulator.
Instead of memorizing vocabulary, users participate in interactive missions where they must speak the target language to progress.
Core Features
1. Missions & Roleplay
Users choose structured scenarios like:
Buying a bus ticket
Ordering food
Asking for directions
Meeting a neighbor
Each mission has a clear objective the learner must accomplish.
The AI adopts a persona appropriate to the scenario.
Examples:
Bus Driver
Café Barista
Hotel Receptionist
Friendly Neighbor
2. Two Learning Modes
Teacher Mode
The AI acts as a helpful tutor.
It can:
Translate phrases
Explain grammar
Suggest better sentences
Provide hints in the learner’s native language
Perfect for beginners.
Immersive Mode
This is where Immergo becomes powerful.
In immersive mode:
The AI refuses to switch languages
The learner must speak the target language
The scenario continues only when the learner communicates successfully
We call this the “No Free Rides” rule.
3. Native Language Support
Users can select their native language, allowing the AI to provide contextual help and explanations when needed.
This makes learning accessible even for beginners.
4. Real-Time Voice Conversations
Users speak naturally into their microphone.
The AI responds instantly with low-latency audio, creating a conversation that feels natural.
This is powered by Gemini Live streaming.
5. Performance Scoring
After completing a mission, Immergo evaluates the learner.
We introduced three fluency levels:
Tiro – Beginner
Proficiens – Intermediate
Peritus – Advanced
The AI provides actionable feedback to improve pronunciation, fluency, and vocabulary.
Tech Stack
Immergo is built with a lightweight but powerful stack.
Frontend
Vanilla JavaScript
Vite
Web Audio API
WebSocket
Backend
Python
FastAPI
Google GenAI SDK
AI Model
Gemini Live via Vertex AI
Communication
WebSockets for full-duplex audio streaming
System Architecture
The architecture is built around real-time streaming.
User Microphone
│
▼
Web Audio API
│
▼
WebSocket Stream
│
▼
FastAPI Backend
│
▼
Gemini Live API
│
▼
AI Response (Audio + Text)
│
▼
Browser Audio Playback
Key principle:
Audio flows continuously between the user and the AI.
This allows conversations to feel instant and natural.
The Core: Gemini Live Integration
The heart of Immergo is the GeminiLive session manager.
It connects the backend to the Gemini Live streaming interface and handles:
Audio streaming
Transcriptions
Tool calls
AI responses
The connection is established using the async Gemini client.
async with self.client.aio.live.connect(model=self.model, config=config) as session:
Once connected, we create async tasks that handle the data streams.
Streaming User Audio to Gemini
Audio from the browser arrives through a WebSocket and is placed into a queue.
The backend continuously streams this audio to Gemini.
async def send_audio():
while True:
chunk = await audio_input_queue.get()
await session.send_realtime_input(
audio=types.Blob(data=chunk, mime_type=f"audio/pcm;rate={self.input_sample_rate}")
)
This enables real-time speech recognition and understanding.
Receiving AI Audio Responses
Gemini Live sends audio responses in chunks.
Our system forwards these chunks back to the browser for playback.
if part.inline_data:
await audio_output_callback(part.inline_data.data)
This creates a fluid voice conversation between the user and the AI persona.
Real-Time Transcriptions
To help learners understand the conversation, we enable both:
Input transcription (what the user said)
Output transcription (what the AI said)
config_args["output_audio_transcription"] = types.AudioTranscriptionConfig()
config_args["input_audio_transcription"] = types.AudioTranscriptionConfig()
These transcriptions allow us to:
Display subtitles
Analyze fluency
Provide feedback
AI Personas
Each mission configures Gemini with a system instruction.
Example:
You are a friendly bus driver in Madrid.
The user wants to buy a ticket.
Only speak Spanish.
Keep responses short and natural.
This gives the AI a consistent personality and context.
Tool Calling for Game Logic
We also implemented function calling so the AI can interact with the mission engine.
Example tools:
complete_mission()give_hint()evaluate_fluency()
The backend registers tools like this:
def register_tool(self, func: Callable):
self.tool_mapping[func.__name__] = func
When Gemini calls a function, the backend executes it and returns the result.
Handling Interruptions
Natural conversations include interruptions.
Gemini Live signals when the user interrupts the AI.
if server_content.interrupted:
await event_queue.put({"type": "interrupted"})
The client then stops playback and listens to the user.
This makes the experience feel much more human.
Challenges We Ran Into
1. Real-Time Audio Latency
Streaming audio while maintaining low latency was tricky.
Solutions included:
Smaller audio chunks
Efficient WebSocket streaming
Async queues
2. Interrupt Handling
Users often speak over the AI.
We had to ensure the system could:
Stop playback
Immediately switch back to listening
3. Persona Consistency
Without strict instructions, the AI sometimes broke character.
We solved this using structured system prompts and strict mission constraints.
Accomplishments We're Proud Of
We successfully built:
A real-time AI conversation engine
A roleplay-based language learning system
A low-latency voice AI experience
A mission-driven learning structure
The biggest achievement was making the AI feel like a real conversational partner.
What We Learned
Building Immergo taught us several lessons.
Voice AI changes everything
When AI speaks instead of typing, the experience becomes far more engaging.
Context matters
Roleplay scenarios dramatically improve language retention.
Learners remember phrases better when they are tied to real situations.
Gemini Live is powerful
The streaming capabilities make it possible to build interactive AI experiences that feel natural.
What’s Next for Immergo
We’re excited about the future roadmap.
Upcoming features include:
Multiplayer roleplay scenarios
Visual environments (VR / AR)
Adaptive difficulty missions
Pronunciation scoring
AI-driven learning paths
Ultimately, our goal is simple:
Make language learning feel like living in the language.
#GeminiLiveAgentChallenge