Real-time lip sync for AI avatars: the viseme pipeline explained

How real-time lip sync works in the browser: visemes, phoneme mapping, character-level TTS timestamps, and where the naive approaches break.

When an AI avatar speaks, its mouth has to move with the audio. Drift by 60 milliseconds and the uncanny-valley effect hits. Users look away.

Getting this right in real time is harder than it looks. You're synchronizing three clocks that don't share a source: text from an LLM, audio from a TTS model, and an animation loop running at 60 frames per second in a browser. Each stage adds latency. Each introduces jitter. Browsers don't guarantee frame timing.

Below: the math behind visemes, the usual approaches, where they break, and how to think about putting the pipeline together so timing feels right.

What is a viseme?

A viseme is the visual mouth shape that corresponds to a phoneme. A phoneme is what you hear (/p/, /æ/, /t/). A viseme is what you see: lips pressed together, mouth open wide, tongue behind teeth.

Multiple phonemes share one viseme. The phonemes /p/, /b/, and /m/ all produce the same closed-lip shape, so they map to a single viseme. English usually uses somewhere between 15 and 20 visemes. The Disney and Preston Blair canonical charts use 10 to 12.

You can sync lips convincingly with 15 good blend shapes and transitions between them. You don't need 40 targets.

The classical pipeline

The traditional pipeline looks like this:

Audio arrives from a TTS model.
A model such as NVIDIA Audio2Face or Oculus OVR LipSync analyzes the waveform and predicts viseme weights frame by frame.
The avatar blends its mouth shapes according to those weights.

This approach is called audio-to-face. Most production systems use some version of it. HeyGen, D-ID, and Synthesia all rely on the family for pre-rendered video.

It has problems when you push it to real time in a browser.

Latency budget is tight. You have roughly 200 to 400 milliseconds between "user stops talking" and "avatar starts responding" before the conversation feels broken. The audio-to-face model adds 20 to 80 ms of inference per chunk. That's on top of TTS generation and network transport.

Model inference in the browser is expensive. Running a viseme predictor via WebGPU or WASM uses CPU and GPU budget you'd rather spend on rendering the avatar itself.

Chunked audio is noisy. When TTS streams audio in small chunks, the model sees short windows and produces unstable predictions at the boundaries. You get visible jitter in the mouth shape.

The character-timestamp approach

Some modern TTS providers emit character-level timestamps alongside the audio stream. Each character in the output text comes with a start time and end time in milliseconds, measured against the audio's own clock.

This changes the problem. Instead of extracting visemes from audio after the fact, you generate them from the text before the audio plays, using a grapheme-to-phoneme-to-viseme mapping.

The flow:

LLM produces text, say "Hello".
TTS streams audio and per-character timestamps: H: 0 to 50 ms, e: 50 to 100 ms, l: 100 to 200 ms, l: 200 to 300 ms, o: 300 to 450 ms.
A lookup table converts each character or grapheme cluster into a phoneme, then into a viseme.
The animation layer schedules viseme transitions at the exact millisecond the audio will play them.

Timing comes from the TTS system itself, so it's accurate to the sample. No inference at playback time. No drift. No extra model on the client.

Where this still breaks

Character-timestamp mapping isn't free. Three real gotchas.

Graphemes vs phonemes. English orthography is unreliable. "Thought" has seven letters and three phonemes. A naive character-to-viseme map produces the wrong shapes. You need a grapheme-to-phoneme model, or access to the TTS engine's own pronunciation data, to map accurately.

Silent letters and elisions. "Knight" has no /k/ sound. "Wednesday" drops the middle d. A naive map will try to produce shapes that shouldn't be there.

Co-articulation. Real mouths don't snap between shapes. The lips anticipate the next sound. A good pipeline blends adjacent visemes rather than stepping hard between them.

Solid implementations handle the first two by pulling phoneme alignment from the TTS stream when it's available, and falling back to a public pronouncing dictionary when it isn't. The third is smoothing: interpolate between viseme targets using a short ease curve (around 40 ms works) so the mouth feels continuous instead of stop-motion.

Putting it together

At a high level, a speaking avatar in the browser needs five things running together:

End-of-utterance detection that knows when the user has stopped speaking (not just paused).
An LLM that streams text as it generates.
A TTS that streams audio and emits per-character alignment alongside it.
A scheduler that converts alignment events into viseme targets on the audio playback clock.
A frame-synced blend loop that drives the avatar's mouth toward the target every frame.

The critical design point is that timing is owned by the audio stream, not inferred after the fact. Every other clock in the system (network transport, animation, rendering) sits under the audio timeline. That's what makes the avatar feel present instead of mechanical.

Want to see a working version live? The Quickstart gets a companion with voice and lip sync on your site in about five minutes. The Voice Setup guide covers the full voice configuration.

Why browser, not server

Fair question. Why do lip sync in the browser at all? Couldn't you render the whole avatar server-side and stream it as video?

You could. Two reasons the browser wins for this shape of product:

Latency. Every server round-trip adds 50 to 150 ms. On a three-second response, that's 10 to 20 percent added latency. You feel it. WebSockets carrying alignment events cost effectively nothing.

Bandwidth. A 720p avatar video at 30 fps runs 1 to 3 Mbps continuously. Alignment events over a WebSocket are around 1 KB per second.

Server-side rendering wins if the avatar is photoreal 4K on an ancient laptop. Most embedded use cases want stylized 60 fps avatars. For that, the browser wins on every axis that matters. More on why we went this way in the SDK guide.

FAQ

What's the difference between a viseme and a phoneme? A phoneme is an audible unit of speech. A viseme is the visual mouth shape that corresponds to it. Multiple phonemes can share one viseme because the mouth can't visually distinguish between, say, /p/ and /b/. Both are closed lips.

Can I use NVIDIA Audio2Face for real-time lip sync? You can, but it adds meaningful inference cost at playback time. For real-time conversational use in the browser, pulling timing from the TTS stream tends to be faster and more accurate than running a separate audio-to-face model.

Does this work with custom avatars? Yes. Any avatar with blend-shape targets for a standard viseme set (ARKit's 52 face blend shapes, Oculus/Meta's 15-viseme set, or the Disney canonical set) can be driven by an alignment-event stream.

Which TTS providers offer character-level timestamps? Several. ElevenLabs is one. Azure SSML-based voices expose similar alignment data. OpenAI's TTS API doesn't provide character-level alignment at the time of writing.

How much latency does a timestamp-based viseme pipeline add? Effectively zero on top of TTS, because timestamps arrive with the audio. Viseme transitions are scheduled against the audio clock so they don't drift.

Is character-timestamp lip sync better than audio-to-face? For real-time browser use, yes. Audio-to-face models are strong for pre-rendered video where inference cost is fine. For live avatars with a tight latency budget, pulling timing from the TTS is faster and more accurate.