Audio Generation Node - Builder Studio Docs

Audio generation produces spoken voice on the canvas through ElevenLabs. It covers two voice modes: text-to-speech, which narrates a prompt with a chosen voice, and speech-to-speech, which re-voices an input recording while preserving its delivery. Both run through the shared /api/generate-audio route and store the result as durable media.

Two voice nodes back this route

In the canvas these are the Text to Speech and Speech to Speech nodes. Both are locked to ElevenLabs and share the audio generation route. For non-voice audio, use SFX Generation. To store or replay an existing file, use the Audio node.

What it does

Text-to-speech — converts a prompt into narrated audio with a selected voice and model.
Speech-to-speech — converts an input recording into the same content spoken by a target voice. The Speech to Speech node reads its audio from a connected audio-in input.
Decodes the returned audio, stores it durably, and emits it on the audio-out port.

Provider and defaults

Audio generation is bound to ElevenLabs. When the node does not specify a voice or model, the route applies these defaults.

Setting	Default	Applies to
Voice	`EXAVITQu4vr4xnSDxMaL` (Sarah)	Text-to-speech and speech-to-speech
Text-to-speech model	`eleven_multilingual_v2`	Text-to-speech
Speech-to-speech model	`eleven_multilingual_sts_v2`	Speech-to-speech
Output format	`audio/mpeg` (MP3)	All modes

Inputs

Prompt / text — required for text-to-speech. The route accepts either text or prompt.
Audio input — required for speech-to-speech, supplied as inline base64 (audioBase64) or a public HTTPS URL (audioUrl). The Speech to Speech node resolves this from its connected audio input.
Voice — an ElevenLabs voice id. The route enforces voice usage access when shared credentials are used.
Voice settings — optional stability, similarity boost, style, speed, and speaker boost, each clamped to its valid range.
Text-to-speech extras — language code, seed, surrounding text for continuity, request id stitching, pronunciation dictionaries, and text-normalization options.

Voice settings are clamped

Stability, similarity boost, and style are clamped to the 0–1 range; speed is clamped to its supported voice-speed range; seed is clamped to a valid integer range. Out-of-range values are corrected rather than rejected.

Outputs

On success the node emits MP3 audio on the audio-out port and stores it as durable media. Downstream nodes consume it like any other audio asset, and it remains reusable across sessions.

Generate voice audio

Add a voice node
Add a Text to Speech or Speech to Speech node. The configuration panel opens automatically so you can choose a voice and model.
Provide the source
For text-to-speech, type a prompt or connect a text source. For speech-to-speech, connect an audio input on audio-in.
Tune voice and settings
Pick a voice and adjust stability, similarity, style, or speed. Speech-to-speech can also remove background noise from the source recording.
Run and reuse
Run the node to generate audio. The stored result flows out of audio-out for downstream steps.

Agent and API notes

Both modes post to /api/generate-audio with a mode field. The route is rate-limited, bound to a canvas for credential scoping, and idempotent per provider operation so a retried run does not double-charge. The examples below show a text-to-speech and a speech-to-speech body.

1{2  "mode": "tts",3  "text": "Welcome to Builder Studio.",4  "voice": "EXAVITQu4vr4xnSDxMaL",5  "model": "eleven_multilingual_v2"6}

1{2  "mode": "sts",3  "voice": "EXAVITQu4vr4xnSDxMaL",4  "model": "eleven_multilingual_sts_v2",5  "audioUrl": "https://example.com/source-recording.mp3"6}

Was this page helpful?

What it does

Provider and defaults

Inputs

Outputs

Generate voice audio

Add a voice node

Provide the source

Tune voice and settings

Run and reuse

Agent and API notes