Overview
The WebSocket API enables real-time, chunked audio streaming for low-latency TTS generation. Audio data is delivered incrementally as base64-encoded chunks, allowing playback to begin before the full generation is complete.The WebSocket API uses the same generation parameters as the REST TTS endpoint, but delivers audio as a stream of chunks rather than a single response.
Connection
Connect to the WebSocket endpoint with your API key:x-api-key header or query parameter.
Request format
Send a JSON message on the WebSocket connection:The type of generation request.
Model ID to use for generation (e.g.,
dd-etts-3.0).Text to convert to speech.
Language locale code (e.g.,
en-US, fr-FR).ID of the voice prompt to use. Supports
asset: prefix for built-in voices.Optional client-provided ID. Auto-generated if not provided.
Target audio duration in seconds.
Playback speed multiplier (0.5-2.0).
Voice variation level (0.0-1.0).
Random seed for deterministic generation.
Generation temperature (0.0-1.0).
Output sample rate in Hz. Internal generation is 48 kHz, resampled to the requested rate. Defaults to 8000 Hz for
mulaw if not specified.Output audio format:
wav (default), mp3, opus, mulaw, or s16le. Streaming input with ctx/isFinal only supports wav, s16le, and mulaw.Enhance voice prompt characteristics.
Enable super stretch mode for longer audio.
Enable real-time priority processing.
Apply audio cleanup processing.
Automatically adjust audio gain levels.
Accent blending parameters. See AccentControl below.
ID of a performance reference prompt to guide delivery style.
Example request
Response format
Audio chunks
Audio is delivered as a series of JSON messages. Each chunk contains a portion of the audio data:Sequential chunk index starting from 0.
The generation ID for this request. Use this to correlate chunks with requests when running multiple generations on the same connection.
Base64-encoded audio data for this chunk.
true when this is the final chunk of the generation.Example response stream
Initial acknowledgement:Error responses
When an error occurs, the WebSocket sends a JSON error message:Human-readable error description.
Error category. One of:
RateLimit, MaxExceeded, InsufficientCredits, InvalidInput.The generation ID, if available.
| Error type | Description |
|---|---|
RateLimit | Too many concurrent requests. Reduce request frequency. |
MaxExceeded | Maximum generation minutes reached for your plan. |
InsufficientCredits | Account has insufficient credits. Top up your balance. |
InvalidInput | Invalid request parameters. Check your request body. |
Accent control
Blend accents between two locales using theaccentControl object:
| Field | Type | Description |
|---|---|---|
accentBaseLocale | string | Base accent locale (e.g., en-US) |
accentLocale | string | Target accent to blend (e.g., fr-FR) |
accentRatio | number | Blend ratio from 0.0 (base only) to 1.0 (target only) |
Supported output formats
Audio chunks are delivered as base64-encoded data in JSON messages.| Format | Standard requests | Streaming input (ctx/isFinal) |
|---|---|---|
wav | Yes (default) | Yes |
mp3 | Yes | No |
opus | Yes | No |
mulaw | Yes | Yes |
s16le | Yes | Yes |
Streaming input with
ctx/isFinal only supports wav, s16le, and mulaw formats.Sample rates
The internal generation runs at 48 kHz and is resampled to the requested rate. If no sample rate is specified,mulaw defaults to 8000 Hz.
REST vs WebSocket comparison
| Feature | REST API | WebSocket API |
|---|---|---|
| Delivery | Streaming HTTP response (chunked audio bytes) | Chunked audio delivered incrementally as base64-encoded JSON messages |
| Formats | mp3, opus, mulaw | wav (default), mp3, opus, mulaw, s16le |
| Streaming input (ctx/isFinal) | Not supported | wav, s16le, mulaw only |
| Default format | mp3 | wav |
| Default mulaw sample rate | 8000 Hz | 8000 Hz |
| Best for | Simple integrations, file generation | Real-time playback, low-latency applications |
