Learning Golf with Gemini Live and WebRTC

Earlier today, I shared a demo using Gemini Live and Stream’s real-time Video API to help users improve their golf game. In this blog post, we’ll look at how that integration came to life and the steps developers can take to build similar Video AI applications using LLMs and WebRTC-based Video APIs.

To get started, we will use Stream’s free Video API as the underlying platform for low-latency video streaming and the latest version of Google’s Gemini Live API. For this demo, we will use a simple React frontend to capture the video from our device’s camera, but you can follow along in any of the client-side SDKs Stream offers. The magic for this integration happens on the backend.

Getting started

First, let’s create our various accounts. As mentioned, we will need a free account from Stream to access their Video API. Stream offers real-time APIs across Chat, Activity Feeds, Moderation, and Video. A single account can be used across all of these products, but we are only interested in the Video part for this demo.

Once you have a Stream account, you will then create a project. Feel free to name this whatever you like, and for the server location, pick the one closest to your city 🙂.

Getstream.io dashboard for video and voice applications

Next, we need to create an account on Google’s AI Studio for an API key. This key will require you to enter a credit card to bill. The rates for these can be found here.

Development Setup

For this project, we will be using Python version 3.12.11 with uv as our package manager of choice.

To install Python 3.12 with UV, you can run the following:

uv python install 3.12

Next, we can create a new project for us to work in:

uv init <your-project-name>

With our project created, we can install the Stream Python SDK and the required dependencies:

uv add "getstream[webrtc]" --prerelease=allow

uv add  "python-dotenv" "aiortc" "numpy" "pillow>=11.3.0" "opencv-python>=4.12.0.88" "google-genai>=1.20.0" "pyaudio>=0.2.14"

Finally, we can create a .env for the project and move on to integrating Stream Video.

STREAM_API_KEY=your_stream_api_key
STREAM_API_SECRET=your_stream_api_secret
GOOGLE_API_KEY=your_gemini_api_key

Backend Integration

We’ll create a call with two participants: the player and the AI bot.

from getstream.stream import Stream
from uuid import uuid4
client = Stream.from_env()
call_id = f"video-ai-example-{str(uuid4())}"
call = client.video.call("default", call_id)
call.get_or_create(data={"created_by_id": "ai-example"})
player_user_id = f"player-{uuid4().hex[:8]}
ai_user_id = f"ai-{uuid4().hex[:8]}"

We use rtc.join() to join the call, then set up listeners for new tracks. The track_added callback tells us when the player publishes a video (or audio) track.

async with await rtc.join(call, ai_user_id) as ai_connection:
    ai_connection.on(
        "track_added",
        lambda track_id, track_type, user: asyncio.create_task(
            on_track_added(track_id, track_type, user, player_user_id, ai_connection, audio_in_queue)
        )
    )

Receive Video Frames

When a video track comes in, we receive frames at 2 fps and send them to Gemini:

video_frame: aiortc.mediastreams.VideoFrame = await track.recv()
img = video_frame.to_image()
await session.send_realtime_input(media=img)

The Gemini session is configured with a specific prompt and response modality:

gemini_config = types.LiveConnectConfig(
    response_modalities=[Modality.AUDIO],
    system_instruction="You are Jeff, a mini-golf expert...",
    media_resolution=MediaResolution.MEDIA_RESOLUTION_MEDIUM,
    temperature=0.1,
)

You connect with:

client = genai.Client(api_key=os.getenv("GOOGLE_API_KEY"))
session = await client.aio.live.connect(
    model="models/gemini-live-2.5-flash-preview",
    config=gemini_config,
)

Handle Gemini’s Audio Responses

Gemini streams audio responses back. We play them using PyAudio:

async def play_audio(audio_in_queue, ai_connection):
    audio = AudioStreamTrack(framerate=24000, stereo=False, format="s16")
    await ai_connection.add_tracks(audio=audio)
    while True:
        bytestream = await audio_in_queue.get()
        await audio.write(bytestream)

The session receives chunks in a loop:

async for response in session.receive():
    if data := response.data:
        audio_in_queue.put_nowait(data)

Voice Input from Player (Optional)

If you want the AI to also respond to the player's voice:

Use Silero VAD to detect if they’re speaking.
Forward audio to Gemini:

if user.user_id == player_user_id and g_session:
    await g_session.send_realtime_input(
        audio=types.Blob(
            data=pcm.samples.astype(np.int16).tobytes(),
            mime_type="audio/pcm;rate=48000"
        )
    )

You can hook this into on_audio and on_pcm from your Stream RTC connection.

Delete Temp Users After the Call

Once the session ends, clean up:

client.delete_users([player_user_id, ai_user_id])

Bringing it together

This was a fun exploration into fusing real-time video, LLMs, and synthetic audio. It’s not production-grade coaching yet, but it’s surprisingly responsive and useful for prototyping sport/exercise feedback tools.

There are some limitations and weirdness to be aware of:

Keep frame rate low – Gemini 2.5 is fast, but not real-time video-fast. 1–2 fps works best.
Prompt matters – Shape the coaching tone, don’t over-prompt, or you’ll get hallucinations.
Audio quality – Use PCM audio with 48000 Hz for input and 24000 Hz for output.
Test in pieces – Run the video and audio subsystems separately before combining.

I am personally very excited about video AI, unlike text or voice, which only captures part of the story. AI that sees and can react to the world around you is the foundation for some very cool use cases, such as robotics, wearables, avatars, etc.

The example in this blog post is meant to be modular; you can easily swap out Gemini for another model or replace the mini-golf prompt with one for yoga, boxing, or even ASL recognition. If you decide to build something similar, please share it. I am @Nash0x7e2 in most places. :)