AI Voice
What is Latency?
Definition
The delay between a user speaking and the AI voice agent responding — the single most critical performance metric in voice AI, where delays above ~1.5 seconds feel unnatural.
In more detail
Voice AI latency is the sum of three pipeline stages: speech-to-text (STT) transcription time, LLM inference time (processing the transcript and generating a response), and text-to-speech (TTS) synthesis time. Each stage adds delay. End-to-end latency is what the user experiences — and anything above 1.5 seconds starts to feel like the system is struggling.
Streaming is the key optimisation at every stage. Streaming STT begins transcribing as the user speaks, sending partial transcripts before the sentence is finished. Streaming LLM responses start generating before the full context is processed. Streaming TTS begins playing audio as soon as the first sentence of the response is synthesised — before the full response is ready.
Infrastructure choices also matter: LLM inference on GPUs with low-latency APIs, STT models optimised for real-time (Deepgram Nova over batch Whisper), TTS with streaming endpoints (ElevenLabs Turbo), and server regions geographically close to users. Achieving sub-700ms end-to-end latency requires optimisation at every layer simultaneously.
Why it matters
High latency breaks the illusion of natural conversation. Users interpret pauses as confusion, system failure, or incompetence. Latency optimisation is where the majority of production voice AI engineering effort goes — it's the difference between a demo and a deployable system.
Related terms
Further reading
Related service
Working with Latency?
I offer AI Integration & Agentic Workflows for businesses ready to move from understanding to implementation.
Learn about AI Integration & Agentic Workflows →