Skip to main content

Overview

The WebSocket API enables real-time, chunked audio streaming for low-latency TTS generation. Audio data is delivered incrementally as base64-encoded chunks, allowing playback to begin before the full generation is complete.
The WebSocket API uses the same generation parameters as the REST TTS endpoint, but delivers audio as a stream of chunks rather than a single response.

Connection

Connect to the WebSocket endpoint with your API key:
wss://wsapi.deepdub.ai/open
Authentication is handled during the WebSocket handshake via the x-api-key header or query parameter.

Request format

Send a JSON message on the WebSocket connection:
action
string
default:"text-to-speech"
The type of generation request.
model
string
required
Model ID to use for generation (e.g., dd-etts-3.0).
targetText
string
required
Text to convert to speech.
locale
string
required
Language locale code (e.g., en-US, fr-FR).
voicePromptId
string
required
ID of the voice prompt to use. Supports asset: prefix for built-in voices.
generationId
string
Optional client-provided ID. Auto-generated if not provided.
targetDuration
number
Target audio duration in seconds.
tempo
number
Playback speed multiplier (0.5-2.0).
variance
number
Voice variation level (0.0-1.0).
seed
integer
Random seed for deterministic generation.
temperature
number
Generation temperature (0.0-1.0).
sampleRate
integer
Output sample rate in Hz. Internal generation is 48 kHz, resampled to the requested rate. Defaults to 8000 Hz for mulaw if not specified.
format
string
default:"wav"
Output audio format: wav (default), mp3, opus, mulaw, or s16le. Streaming input with ctx/isFinal only supports wav, s16le, and mulaw.
promptBoost
boolean
Enhance voice prompt characteristics.
superStretch
boolean
Enable super stretch mode for longer audio.
realtime
boolean
Enable real-time priority processing.
cleanAudio
boolean
default:"true"
Apply audio cleanup processing.
autoGain
boolean
Automatically adjust audio gain levels.
accentControl
object
Accent blending parameters. See AccentControl below.
performanceReferencePromptId
string
ID of a performance reference prompt to guide delivery style.

Example request

{
  "action": "text-to-speech",
  "model": "dd-etts-3.0",
  "targetText": "Welcome to Deepdub's real-time text to speech API.",
  "locale": "en-US",
  "voicePromptId": "bd1b00bb-be1c-4679-8eaa-0fcbfd4ff773",
  "format": "wav",
  "sampleRate": 16000
}

Response format

Audio chunks

Audio is delivered as a series of JSON messages. Each chunk contains a portion of the audio data:
index
integer
Sequential chunk index starting from 0.
generationId
string
The generation ID for this request. Use this to correlate chunks with requests when running multiple generations on the same connection.
data
string
Base64-encoded audio data for this chunk.
isFinished
boolean
true when this is the final chunk of the generation.

Example response stream

Initial acknowledgement:
{
  "data": "",
  "generationId": "4da9902b-9141-4fb7-9efb-d616ce266ed9",
  "isFinished": false
}
Audio chunks:
{
  "index": 0,
  "generationId": "4da9902b-9141-4fb7-9efb-d616ce266ed9",
  "data": "//uQxAAAAAANIAAAAAExBTUUzLjEwMFVVVVVVVVVV...",
  "isFinished": false
}
{
  "index": 1,
  "generationId": "4da9902b-9141-4fb7-9efb-d616ce266ed9",
  "data": "HAAYABgAGAAgACAA...",
  "isFinished": false
}
Final chunk:
{
  "index": 2,
  "generationId": "4da9902b-9141-4fb7-9efb-d616ce266ed9",
  "data": "AAAAAAAAAA==",
  "isFinished": true
}

Error responses

When an error occurs, the WebSocket sends a JSON error message:
error
string
Human-readable error description.
errorType
string
Error category. One of: RateLimit, MaxExceeded, InsufficientCredits, InvalidInput.
generationId
string
The generation ID, if available.
{
  "error": "Rate limit exceeded",
  "errorType": "RateLimit",
  "generationId": "4da9902b-9141-4fb7-9efb-d616ce266ed9"
}
Error typeDescription
RateLimitToo many concurrent requests. Reduce request frequency.
MaxExceededMaximum generation minutes reached for your plan.
InsufficientCreditsAccount has insufficient credits. Top up your balance.
InvalidInputInvalid request parameters. Check your request body.

Accent control

Blend accents between two locales using the accentControl object:
{
  "accentControl": {
    "accentBaseLocale": "en-US",
    "accentLocale": "fr-FR",
    "accentRatio": 0.75
  }
}
FieldTypeDescription
accentBaseLocalestringBase accent locale (e.g., en-US)
accentLocalestringTarget accent to blend (e.g., fr-FR)
accentRationumberBlend ratio from 0.0 (base only) to 1.0 (target only)

Supported output formats

Audio chunks are delivered as base64-encoded data in JSON messages.
FormatStandard requestsStreaming input (ctx/isFinal)
wavYes (default)Yes
mp3YesNo
opusYesNo
mulawYesYes
s16leYesYes
Streaming input with ctx/isFinal only supports wav, s16le, and mulaw formats.

Sample rates

The internal generation runs at 48 kHz and is resampled to the requested rate. If no sample rate is specified, mulaw defaults to 8000 Hz.

REST vs WebSocket comparison

FeatureREST APIWebSocket API
DeliveryStreaming HTTP response (chunked audio bytes)Chunked audio delivered incrementally as base64-encoded JSON messages
Formatsmp3, opus, mulawwav (default), mp3, opus, mulaw, s16le
Streaming input (ctx/isFinal)Not supportedwav, s16le, mulaw only
Default formatmp3wav
Default mulaw sample rate8000 Hz8000 Hz
Best forSimple integrations, file generationReal-time playback, low-latency applications

Code examples

Python

import asyncio
from deepdub import DeepdubClient

client = DeepdubClient(api_key="dd-00000000000000000000000065c9cbfe")

async def streaming_tts():
    audio_data = bytearray()
    async with client.async_connect() as conn:
        async for chunk in conn.async_tts(
            text="Hello, this is streamed text input.",
            voice_prompt_id="bd1b00bb-be1c-4679-8eaa-0fcbfd4ff773",
            model="dd-etts-3.0",
            locale="en-US",
            format="wav",
            sample_rate=16000,
        ):
            audio_data.extend(chunk)
            print(f"Received chunk: {len(chunk)} bytes")

    with open("output.wav", "wb") as f:
        f.write(audio_data)
    print(f"Total audio: {len(audio_data)} bytes")

asyncio.run(streaming_tts())

JavaScript

const { DeepdubClient } = require("@deepdub/node");
const fs = require("fs");

async function streamingTts() {
  const deepdub = new DeepdubClient("dd-00000000000000000000000065c9cbfe");
  await deepdub.connect();

  const chunks = [];
  for await (const chunk of deepdub.streamTts("Hello, this is streamed text input.", {
    locale: "en-US",
    voicePromptId: "bd1b00bb-be1c-4679-8eaa-0fcbfd4ff773",
    model: "dd-etts-3.0",
    format: "wav",
    sampleRate: 16000,
  })) {
    chunks.push(chunk);
    console.log(`Received chunk: ${chunk.length} bytes`);
  }

  const audio = Buffer.concat(chunks);
  fs.writeFileSync("output.wav", audio);
  console.log(`Total audio: ${audio.length} bytes`);

  deepdub.disconnect();
}

streamingTts();