End-to-End Latency
The total time from a caller finishing a sentence to when the AI begins its response.
End-to-end latency (also called voice-to-voice latency) is the total elapsed time from the moment a caller stops speaking to the moment they hear the AI begin its reply. It is the single most important quality metric in conversational voice AI — humans perceive pauses longer than roughly 800 ms as awkward, and conversation feels natural below about 400 ms.
The latency budget
The total is the sum of every pipeline stage, so each gets a budget:
- Endpointing (detecting the caller finished): ~50–200 ms
- ASR/STT finalizing the transcript: ~50–150 ms
- LLM producing the first token: ~150–300 ms
- TTS producing the first audio chunk: ~75–150 ms
- Network and buffering: ~50–100 ms
How low latency is achieved
Production systems hit sub-400 ms through pipeline parallelism — streaming the LLM's tokens into TTS so audio synthesis starts before the full response is generated (streaming TTS), co-locating models to remove network hops, warming model instances, and tuning jitter buffers. A single slow stage blows the entire budget, so latency is managed end to end, not per component.