Definition

End-to-End Latency

The total time from a caller finishing a sentence to when the AI begins its response.

End-to-end latency (also called voice-to-voice latency) is the total elapsed time from the moment a caller stops speaking to the moment they hear the AI begin its reply. It is the single most important quality metric in conversational voice AI — humans perceive pauses longer than roughly 800 ms as awkward, and conversation feels natural below about 400 ms.

The latency budget

The total is the sum of every pipeline stage, so each gets a budget:

Endpointing (detecting the caller finished): ~50–200 ms
ASR/STT finalizing the transcript: ~50–150 ms
LLM producing the first token: ~150–300 ms
TTS producing the first audio chunk: ~75–150 ms
Network and buffering: ~50–100 ms

How low latency is achieved

Production systems hit sub-400 ms through pipeline parallelism — streaming the LLM's tokens into TTS so audio synthesis starts before the full response is generated (streaming TTS), co-locating models to remove network hops, warming model instances, and tuning jitter buffers. A single slow stage blows the entire budget, so latency is managed end to end, not per component.

Related Resources

What Is an AI Voice Agent? (Full Guide) TurboCall AI Voice Agent

← Back to Glossary

Healthcare

Professional Services

Commerce & Retail

Business Services

Home & Automotive

Lifestyle