SubQ
ConceptsStreaming controls

Utterance detection

Detect end of speech utterances by configuring the utterance_end_ms parameter.

Utterance detection identifies when a speaker has finished a turn by monitoring for a period of silence. When the silence period exceeds a threshold you specify, the server sends an UtteranceEnd message.

Why do you need utterance detection?

An utterance is a continuous segment of speech, that contains everything a person says before they stop talking. It might be a single sentence, several sentences, or even a few words. The key distinction is the silence that follows. For instance, when a speaker pauses long enough, that silence marks the boundary between one utterance and the next.

In a conversation, utterances roughly correspond to "turns." For example, in a voice assistant interaction, the user's question is one utterance and the assistant's response is another. You can use utterance detection to segment conversations, trigger processing between turns, or build turn-based interaction flows.

Word timestamps must be enabled for UtteranceEnd events to fire. Word timestamps are included by default in Results messages.

Configure the silence threshold

To enable utterance detection, add the utterance_end_ms parameter in the WebSocket query string. You also specify the silence duration that should trigger an utterance end in milliseconds:

wss://stt-api.subq.ai/v1/listen?utterance_end_ms=1000&encoding=mp3

This example fires an UtteranceEnd event after 1 second of silence. You choose a value based on your use case. A shorter threshold such as 500 ms responds quickly but might trigger during natural mid-thought pauses. On the other hand, a longer threshold such as 1500 ms is more reliable for detecting true turn endings but adds delay.

UtteranceEnd message

When the silence threshold is reached, the server sends a message like this:

{
  "type": "UtteranceEnd",
  "channel": [0],
  "last_word_end": 2.5
}
FieldDescription
typeAlways "UtteranceEnd"
channelArray indicating which channel detected the utterance end
last_word_endTimestamp (in seconds) of the last detected word

Utterance detection vs. endpointing

Both utterance detection and endpointing respond to silence, but they serve different purposes. Endpointing finalizes individual transcript segments or sentences within a turn. Utterance detection tells you when the entire turn is over. You can use both together such that endpointing finalizes sentences within a turn, and utterance detection tells you when the entire turn is over.

FeaturePurposeOutput
EndpointingFinalizes a transcript segment (is_final: true)Updated Results message
Utterance detectionSignals that a speaker finished a turnSeparate UtteranceEnd message