WebSocket Streaming API - Deepdub Documentation

Overview

The WebSocket API enables real-time, chunked audio streaming for low-latency TTS generation. Audio data is delivered incrementally as base64-encoded chunks, allowing playback to begin before the full generation is complete.

The WebSocket API uses the same generation parameters as the REST TTS endpoint, but delivers audio as a stream of chunks rather than a single response.

Connection

Connect to the WebSocket endpoint with your API key:

wss://wsapi.deepdub.ai/open

Authentication is handled during the WebSocket handshake via the x-api-key header or query parameter.

Request format

Send a JSON message on the WebSocket connection:

action

string

default:"text-to-speech"

The type of generation request.

model

string

required

Model ID to use for generation (e.g., dd-etts-3.0).

targetText

string

required

Text to convert to speech.

locale

string

required

Language locale code (e.g., en-US, fr-FR).

voicePromptId

string

required

ID of the voice prompt to use. Supports asset: prefix for built-in voices.

generationId

string

Optional client-provided ID. Auto-generated if not provided.

targetDuration

number

Target audio duration in seconds.

tempo

number

Playback speed multiplier (0.5-2.0).

variance

number

Voice variation level (0.0-1.0).

seed

integer

Random seed for deterministic generation.

temperature

number

Generation temperature (0.0-1.0).

sampleRate

integer

Output sample rate in Hz. Internal generation is 48 kHz, resampled to the requested rate. Defaults to 8000 Hz for mulaw if not specified.

format

string

default:"wav"

Output audio format: wav (default), mp3, opus, mulaw, or s16le. Streaming input with ctx/isFinal only supports wav, s16le, and mulaw.

promptBoost

boolean

Enhance voice prompt characteristics.

superStretch

boolean

Enable super stretch mode for longer audio.

realtime

boolean

Enable real-time priority processing.

cleanAudio

boolean

default:"true"

Apply audio cleanup processing.

autoGain

boolean

Automatically adjust audio gain levels.

accentControl

object

Accent blending parameters. See AccentControl below.

performanceReferencePromptId

string

ID of a performance reference prompt to guide delivery style.

Example request

{
  "action": "text-to-speech",
  "model": "dd-etts-3.0",
  "targetText": "Welcome to Deepdub's real-time text to speech API.",
  "locale": "en-US",
  "voicePromptId": "bd1b00bb-be1c-4679-8eaa-0fcbfd4ff773",
  "format": "wav",
  "sampleRate": 16000
}

Response format

Audio chunks

Audio is delivered as a series of JSON messages. Each chunk contains a portion of the audio data:

index

integer

Sequential chunk index starting from 0.

generationId

string

The generation ID for this request. Use this to correlate chunks with requests when running multiple generations on the same connection.

data

string

Base64-encoded audio data for this chunk.

isFinished

boolean

true when this is the final chunk of the generation.

Example response stream

Initial acknowledgement:

{
  "data": "",
  "generationId": "4da9902b-9141-4fb7-9efb-d616ce266ed9",
  "isFinished": false
}

Audio chunks:

{
  "index": 0,
  "generationId": "4da9902b-9141-4fb7-9efb-d616ce266ed9",
  "data": "//uQxAAAAAANIAAAAAExBTUUzLjEwMFVVVVVVVVVV...",
  "isFinished": false
}

{
  "index": 1,
  "generationId": "4da9902b-9141-4fb7-9efb-d616ce266ed9",
  "data": "HAAYABgAGAAgACAA...",
  "isFinished": false
}

Final chunk:

{
  "index": 2,
  "generationId": "4da9902b-9141-4fb7-9efb-d616ce266ed9",
  "data": "AAAAAAAAAA==",
  "isFinished": true
}

Error responses

When an error occurs, the WebSocket sends a JSON error message:

error

string

Human-readable error description.

errorType

string

Error category. One of: RateLimit, MaxExceeded, InsufficientCredits, InvalidInput.

generationId

string

The generation ID, if available.

{
  "error": "Rate limit exceeded",
  "errorType": "RateLimit",
  "generationId": "4da9902b-9141-4fb7-9efb-d616ce266ed9"
}

Error type	Description
`RateLimit`	Too many concurrent requests. Reduce request frequency.
`MaxExceeded`	Maximum generation minutes reached for your plan.
`InsufficientCredits`	Account has insufficient credits. Top up your balance.
`InvalidInput`	Invalid request parameters. Check your request body.

Accent control

Blend accents between two locales using the accentControl object:

{
  "accentControl": {
    "accentBaseLocale": "en-US",
    "accentLocale": "fr-FR",
    "accentRatio": 0.75
  }
}

Field	Type	Description
`accentBaseLocale`	string	Base accent locale (e.g., `en-US`)
`accentLocale`	string	Target accent to blend (e.g., `fr-FR`)
`accentRatio`	number	Blend ratio from 0.0 (base only) to 1.0 (target only)

Supported output formats

Audio chunks are delivered as base64-encoded data in JSON messages.

Format	Standard requests	Streaming input (ctx/isFinal)
`wav`	Yes (default)	Yes
`mp3`	Yes	No
`opus`	Yes	No
`mulaw`	Yes	Yes
`s16le`	Yes	Yes

Streaming input with ctx/isFinal only supports wav, s16le, and mulaw formats.

Sample rates

The internal generation runs at 48 kHz and is resampled to the requested rate. If no sample rate is specified, mulaw defaults to 8000 Hz.

REST vs WebSocket comparison

Feature	REST API	WebSocket API
Delivery	Streaming HTTP response (chunked audio bytes)	Chunked audio delivered incrementally as base64-encoded JSON messages
Formats	`mp3`, `opus`, `mulaw`	`wav` (default), `mp3`, `opus`, `mulaw`, `s16le`
Streaming input (ctx/isFinal)	Not supported	`wav`, `s16le`, `mulaw` only
Default format	`mp3`	`wav`
Default mulaw sample rate	8000 Hz	8000 Hz
Best for	Simple integrations, file generation	Real-time playback, low-latency applications

Code examples

Python

import asyncio
from deepdub import DeepdubClient

client = DeepdubClient(api_key="dd-00000000000000000000000065c9cbfe")

async def streaming_tts():
    audio_data = bytearray()
    async with client.async_connect() as conn:
        async for chunk in conn.async_tts(
            text="Hello, this is streamed text input.",
            voice_prompt_id="bd1b00bb-be1c-4679-8eaa-0fcbfd4ff773",
            model="dd-etts-3.0",
            locale="en-US",
            format="wav",
            sample_rate=16000,
        ):
            audio_data.extend(chunk)
            print(f"Received chunk: {len(chunk)} bytes")

    with open("output.wav", "wb") as f:
        f.write(audio_data)
    print(f"Total audio: {len(audio_data)} bytes")

asyncio.run(streaming_tts())

JavaScript

const { DeepdubClient } = require("@deepdub/node");
const fs = require("fs");

async function streamingTts() {
  const deepdub = new DeepdubClient("dd-00000000000000000000000065c9cbfe");
  await deepdub.connect();

  const chunks = [];
  for await (const chunk of deepdub.streamTts("Hello, this is streamed text input.", {
    locale: "en-US",
    voicePromptId: "bd1b00bb-be1c-4679-8eaa-0fcbfd4ff773",
    model: "dd-etts-3.0",
    format: "wav",
    sampleRate: 16000,
  })) {
    chunks.push(chunk);
    console.log(`Received chunk: ${chunk.length} bytes`);
  }

  const audio = Buffer.concat(chunks);
  fs.writeFileSync("output.wav", audio);
  console.log(`Total audio: ${audio.length} bytes`);

  deepdub.disconnect();
}

streamingTts();

​Overview

​Connection

​Request format

​Example request

​Response format

​Audio chunks

​Example response stream

​Error responses

​Accent control

​Supported output formats

​Sample rates

​REST vs WebSocket comparison

​Code examples

​Python

​JavaScript

Overview

Connection

Request format

Example request

Response format

Audio chunks

Example response stream

Error responses

Accent control

Supported output formats

Sample rates

REST vs WebSocket comparison

Code examples

Python

JavaScript