Definition

Streaming TTS

A technique where audio synthesis begins before the full text is generated, reducing response latency.

Streaming TTS (text-to-speech) is a synthesis technique where the engine begins producing audio from the first words of a response before the complete text is available. It is one of the most important levers for low-latency voice AI because it removes the wait for a full sentence to be generated before the caller hears anything.

Streaming vs. batch synthesis

Batch TTS takes a finished block of text and returns one complete audio file — simple, but the caller waits for the entire response to be generated and synthesized. Streaming TTS accepts text incrementally and emits audio chunks continuously, so the first-byte-to-audio time can drop below 150 ms.

Pairing with a streaming LLM

The real gain comes from chaining streaming components: as the LLM emits tokens, they are fed into the TTS engine, which starts speaking the beginning of the answer while the LLM is still writing the end. This pipeline parallelism is how production agents hit sub-400 ms end-to-end latency. The engineering challenges are chunk boundary handling (avoiding clipped or mispronounced words at chunk seams) and consistent prosody across chunks so the speech does not sound stitched together.