AI Voice

What is Speech-to-Text (STT)?

Definition

Technology that converts spoken audio into written text in real time — the listening layer of AI voice agents, powered by models like Deepgram and OpenAI Whisper.

In more detail

Speech-to-Text (STT), also called Automatic Speech Recognition (ASR), is the technology that converts audio input into a text transcript. In AI voice agent pipelines, STT is the first stage: the caller speaks, the audio is streamed to the STT model, and the transcript is passed to the language model for understanding and response generation.

The key providers used in production voice AI include Deepgram (optimised for streaming, low-latency transcription — ideal for real-time voice agents), OpenAI Whisper (highly accurate, open-source, but better suited to batch processing than live streaming), and AssemblyAI (offers additional features like speaker diarisation and topic detection). Provider choice significantly affects end-to-end call latency.

Streaming transcription is critical for low-latency voice agents. Rather than waiting for the caller to finish speaking before transcribing, streaming STT returns partial transcripts as words are spoken — allowing the system to begin processing earlier and reducing the perceived delay in the agent's response. This is one of the key architectural decisions in building a natural-feeling voice AI system.

Why it matters

The quality and latency of STT directly determines how natural an AI voice agent feels to callers. Poor transcription accuracy leads to misunderstood queries; high latency leads to awkward pauses. Getting STT right is foundational to the entire voice AI system.

Related service

Working with Speech-to-Text?

I offer AI Integration & Agentic Workflows for businesses ready to move from understanding to implementation.

Learn about AI Integration & Agentic Workflows