Earlier today, I shared a demo using Gemini Live and Stream’s real-time Video API to help users improve their golf game. In this blog post, we’ll look at how that integration came to life and the steps developers can take to build similar Video AI applications using LLMs and WebRTC-based Video APIs.
To get started, we will use Stream’s free Video API as the underlying platform for low-latency video streaming and the latest version of Google’s Gemini Live API. For this demo, we will use a simple React frontend to capture the video from our device’s camera, but you can follow along in any of the client-side SDKs Stream offers. The magic for this integration happens on the backend.
First, let’s create our various accounts. As mentioned, we will need a free account from Stream to access their Video API. Stream offers real-time APIs across Chat, Activity Feeds, Moderation, and Video. A single account can be used across all of these products, but we are only interested in the Video part for this demo.
Once you have a Stream account, you will then create a project. Feel free to name this whatever you like, and for the server location, pick the one closest to your city 🙂.
Getstream.io dashboard for video and voice applications
Next, we need to create an account on Google’s AI Studio for an API key. This key will require you to enter a credit card to bill. The rates for these can be found here.
For this project, we will be using Python version 3.12.11 with uv as our package manager of choice.
To install Python 3.12 with UV, you can run the following:
uv python install 3.12
Next, we can create a new project for us to work in:
uv init <your-project-name>
With our project created, we can install the Stream Python SDK and the required dependencies:
uv add "getstream[webrtc]" --prerelease=allow
uv add "python-dotenv" "aiortc" "numpy" "pillow>=11.3.0" "opencv-python>=4.12.0.88" "google-genai>=1.20.0" "pyaudio>=0.2.14"
Finally, we can create a .env for the project and move on to integrating Stream Video.
STREAM_API_KEY=your_stream_api_key
STREAM_API_SECRET=your_stream_api_secret
GOOGLE_API_KEY=your_gemini_api_key
We’ll create a call with two participants: the player and the AI bot.
from getstream.stream import Stream
from uuid import uuid4
client = Stream.from_env()
call_id = f"video-ai-example-{str(uuid4())}"
call = client.video.call("default", call_id)
call.get_or_create(data={"created_by_id": "ai-example"})
player_user_id = f"player-{uuid4().hex[:8]}
ai_user_id = f"ai-{uuid4().hex[:8]}"
We use rtc.join() to join the call, then set up listeners for new tracks. The track_added callback tells us when the player publishes a video (or audio) track.
async with await rtc.join(call, ai_user_id) as ai_connection:
ai_connection.on(
"track_added",
lambda track_id, track_type, user: asyncio.create_task(
on_track_added(track_id, track_type, user, player_user_id, ai_connection, audio_in_queue)
)
)
When a video track comes in, we receive frames at 2 fps and send them to Gemini:
video_frame: aiortc.mediastreams.VideoFrame = await track.recv()
img = video_frame.to_image()
await session.send_realtime_input(media=img)
The Gemini session is configured with a specific prompt and response modality:
gemini_config = types.LiveConnectConfig(
response_modalities=[Modality.AUDIO],
system_instruction="You are Jeff, a mini-golf expert...",
media_resolution=MediaResolution.MEDIA_RESOLUTION_MEDIUM,
temperature=0.1,
)
You connect with:
client = genai.Client(api_key=os.getenv("GOOGLE_API_KEY"))
session = await client.aio.live.connect(
model="models/gemini-live-2.5-flash-preview",
config=gemini_config,
)
Gemini streams audio responses back. We play them using PyAudio:
async def play_audio(audio_in_queue, ai_connection):
audio = AudioStreamTrack(framerate=24000, stereo=False, format="s16")
await ai_connection.add_tracks(audio=audio)
while True:
bytestream = await audio_in_queue.get()
await audio.write(bytestream)
The session receives chunks in a loop:
async for response in session.receive():
if data := response.data:
audio_in_queue.put_nowait(data)
If you want the AI to also respond to the player's voice:
if user.user_id == player_user_id and g_session:
await g_session.send_realtime_input(
audio=types.Blob(
data=pcm.samples.astype(np.int16).tobytes(),
mime_type="audio/pcm;rate=48000"
)
)
You can hook this into on_audio and on_pcm from your Stream RTC connection.
Once the session ends, clean up:
client.delete_users([player_user_id, ai_user_id])
This was a fun exploration into fusing real-time video, LLMs, and synthetic audio. It’s not production-grade coaching yet, but it’s surprisingly responsive and useful for prototyping sport/exercise feedback tools.
There are some limitations and weirdness to be aware of:
I am personally very excited about video AI, unlike text or voice, which only captures part of the story. AI that sees and can react to the world around you is the foundation for some very cool use cases, such as robotics, wearables, avatars, etc.
The example in this blog post is meant to be modular; you can easily swap out Gemini for another model or replace the mini-golf prompt with one for yoga, boxing, or even ASL recognition. If you decide to build something similar, please share it. I am @Nash0x7e2 in most places. :)