SubQ
ConceptsStreaming controls

WebSocket protocol

Connect to the streaming API and exchange audio, control messages, and transcript results over WebSocket.

The SubQ streaming API uses WebSocket for bidirectional communication. Your client sends binary audio frames and JSON control messages. The server responds with JSON messages that contain transcripts, metadata, and events.

Why WebSocket?

A traditional REST API follows a request-response pattern: you send a request, wait for the response, and the connection closes. This works well for batch transcription where you upload a complete audio file, but it doesn't work for real-time streaming.

Streaming speech-to-text requires a persistent, two-way connection. Your application needs to send audio continuously while simultaneously receiving transcript updates. WebSocket provides this bidirectional channel over a single connection where your client pushes audio frames in one direction, while the server simultaneously responds with transcript results.

Connect to the API

To get started, you connect to one of the following endpoints:

EndpointAccepts
wss://stt-api.subq.ai/v1/listenEncoded audio (MP3, AAC, FLAC, WAV, OGG, WebM, Opus, M4A) and raw PCM
wss://stt-api.subq.ai/v1/listen/pcmRaw PCM only (optimized for low-latency PCM streams)

To establish a secure connection, authenticate with the Sec-WebSocket-Protocol header:

Sec-WebSocket-Protocol: token, YOUR_SUBQ_API_KEY

After a successful handshake, the server returns 101 Switching Protocols.

Send client messages

Your client can send two types of messages: binary audio frames and JSON control messages.

Binary audio frames

Send audio data as binary WebSocket frames. The API supports the following formats:

  • Encoded audio: MP3, AAC, FLAC, WAV, OGG, WebM, Opus, M4A (auto-detected)
  • Raw PCM: 16-bit signed little-endian (s16le), with a configurable sample rate through the sample_rate parameter

There's no required chunk size. You can send audio frames as they become available such as when your microphone produces them. The server buffers and processes audio continuously.

Control messages

You send JSON control messages to manage the stream:

MessageDescription
{"type": "KeepAlive"}Prevents the connection from timing out during periods of silence
{"type": "Finalize"}Flushes the server buffer and returns any remaining results as final
{"type": "CloseStream"}Gracefully closes the connection after processing remaining audio

Receive server messages

The server sends the following JSON message types:

Metadata

The server sends metadata once when the connection opens. It contains session information:

{
  "type": "Metadata",
  "request_id": "abc-123",
  "created": "2026-03-04T12:00:00.000000Z",
  "duration": 0.0,
  "channels": 1,
  "model_info": {
    "name": "<model_id>",
    "version": "",
    "arch": "subq-asr"
  }
}

Results

The server sends transcript data continuously as audio is processed:

{
  "type": "Results",
  "channel_index": [0],
  "duration": 1.98,
  "start": 0.00,
  "is_final": false,
  "speech_final": false,
  "channel": {
    "alternatives": [{
      "transcript": "Hello world",
      "confidence": 0.95,
      "words": [
        ["Hello", 0, 320],
        ["world", 320, 640]
      ]
    }]
  }
}
FieldDescription
is_finaltrue when the transcript for this segment is stable
speech_finaltrue when the speaker has finished an utterance
wordsArray of [word, start_ms, end_ms]. Timestamps are in milliseconds
confidenceConfidence score (0–1)

SpeechStarted

When the server detects voice activity, it sends SpeechStarted if vad_events=true is already configured:

{
  "type": "SpeechStarted",
  "channel": [0],
  "timestamp": 0.0
}

UtteranceEnd

When a silence threshold is reached, the server sends UtteranceEnd. It requires you to have the utterance_end_ms parameter and word timestamps set:

{
  "type": "UtteranceEnd",
  "channel": [0],
  "last_word_end": 2.5
}

Configure query parameters

You can append these parameters to the WebSocket URL to configure the stream:

ParameterDefaultDescription
encodingAuto-detectAudio format: pcm, mp3, aac, flac, wav, ogg, webm, opus, m4a.
sample_rate16000Sample rate in Hz. Applies to PCM audio only.
interim_resultstrueSend partial transcripts while audio is streaming.
endpointingServer defaultSentence finalization delay in milliseconds, or false to disable.
utterance_end_ms-Silence duration (in milliseconds) that triggers an UtteranceEnd event.
vad_eventsfalseSend SpeechStarted events when voice activity is detected.
languageenLanguage code: en, es, or auto.
keywords-Keyword boosting. This parameter is repeatable.
redact-PII redaction mode: pii, pci, numbers, or true.