Skip to main content
POST
/
tts
Generate and stream TTS audio
curl --request POST \
  --url https://restapi.deepdub.ai/api/v1/tts \
  --header 'Content-Type: application/json' \
  --header 'x-api-key: <api-key>' \
  --data '
{
  "model": "dd-etts-3.0",
  "targetText": "Hello world, welcome to Deepdub.",
  "locale": "en-US",
  "voicePromptId": "bd1b00bb-be1c-4679-8eaa-0fcbfd4ff773"
}
'
{
  "success": false,
  "message": "Invalid request: missing required field 'targetText'"
}

Supported output formats

The REST API streams audio as raw bytes in the HTTP response body. Supported formats:
FormatDescription
mp3Compressed audio, smallest file size. Default.
opusHigh-quality compressed audio, efficient for streaming.
mulaw8-bit µ-law encoding, commonly used in telephony. Defaults to 8000 Hz if no sample rate is specified.
The REST API supports mp3, opus, and mulaw only. For wav or s16le output, use the WebSocket API.

Sample rates

The sample rate is passed through to the audio conversion layer. The internal generation runs at 48 kHz and is resampled to the requested rate. If no sample rate is specified, mulaw defaults to 8000 Hz.

REST vs WebSocket comparison

FeatureREST APIWebSocket API
DeliveryStreaming HTTP response (chunked audio bytes)Chunked audio delivered incrementally as base64-encoded JSON messages
Formatsmp3, opus, mulawwav (default), mp3, opus, mulaw, s16le
Streaming input (ctx/isFinal)Not supportedwav, s16le, mulaw only
Default formatmp3wav
Default mulaw sample rate8000 Hz8000 Hz
Best forSimple integrations, file generationReal-time playback, low-latency applications

Authorizations

x-api-key
string
header
required

API key for authentication. Must start with dd- prefix.

Headers

x-api-key
string
default:dd-00000000000000000000000065c9cbfe
required

API Key

Body

application/json

Request structure for TTS generation endpoints.

Optional parameters (not shown in playground): generationId (string), targetDuration (number, seconds), tempo (number, 0.5–2.0), variance (number, 0.0–1.0), seed (integer), temperature (number, 0.0–1.0), sampleRate (integer), format (string: mp3/opus/mulaw — default mp3), promptBoost (boolean), superStretch (boolean), realtime (boolean), cleanAudio (boolean, default true), autoGain (boolean), publish (boolean), accentControl (object with accentBaseLocale, accentLocale, accentRatio), performanceReferencePromptId (string), voiceReference (string, base64-encoded audio).

model
string
default:dd-etts-3.0
required

Model ID to use for generation

Example:

"dd-etts-3.0"

targetText
string
required

Text to be converted to speech

Example:

"Hello world, welcome to Deepdub."

locale
string
required

Language locale code (e.g., en-US, fr-FR)

Example:

"en-US"

voicePromptId
string
required

ID of the voice prompt to use for generation

Example:

"bd1b00bb-be1c-4679-8eaa-0fcbfd4ff773"

Response

Audio stream in the requested format (MP3, Opus, or mulaw depending on format parameter). The response body is raw audio bytes.