Definition

ASR (Automatic Speech Recognition)

Technology that converts spoken audio into written text in real time.

ASR (Automatic Speech Recognition) is the technology that converts spoken audio into written text. In an AI voice agent it is the first stage of the pipeline — the caller speaks, and ASR transcribes that audio into text the language model can reason over. The terms ASR and STT (speech-to-text) are used interchangeably; ASR is the older academic term, STT the more common product term.

Streaming vs. batch ASR

Voice agents require streaming ASR, which returns partial (interim) transcripts as the caller is still speaking, rather than waiting for them to finish. This is what allows the agent to detect when a turn is complete and respond within a few hundred milliseconds. Batch ASR — transcribing a complete recording after the fact — is used for call analytics, QA scoring, and compliance archives, where latency does not matter.

What separates good ASR for phone calls

  • Telephony audio handling: phone audio is narrowband (8 kHz) and noisy; models tuned on clean 16 kHz audio degrade badly without phone-specific training.
  • Endpointing: accurately detecting when the caller has stopped speaking, which drives turn-taking.
  • Custom vocabulary: boosting domain terms — drug names, SKUs, property names — to cut word error rate (WER).
  • Accent and language coverage: robustness across regional accents and code-switching.

Accuracy is measured in word error rate (WER) — the percentage of words inserted, deleted, or substituted versus a human transcript. Lower is better.