Skip to main content
Node ReferenceAudio Generation Node

Audio Generation Node

Generate speech and audio with ElevenLabs TTS.

Audio generation produces spoken voice on the canvas through ElevenLabs. It covers two voice modes: text-to-speech, which narrates a prompt with a chosen voice, and speech-to-speech, which re-voices an input recording while preserving its delivery. Both run through the shared /api/generate-audio route and store the result as durable media.

Two voice nodes back this route
In the canvas these are the Text to Speech and Speech to Speech nodes. Both are locked to ElevenLabs and share the audio generation route. For non-voice audio, use SFX Generation. To store or replay an existing file, use the Audio node.

What it does

  • Text-to-speech — converts a prompt into narrated audio with a selected voice and model.
  • Speech-to-speech — converts an input recording into the same content spoken by a target voice. The Speech to Speech node reads its audio from a connected audio-in input.
  • Decodes the returned audio, stores it durably, and emits it on the audio-out port.

Provider and defaults

Audio generation is bound to ElevenLabs. When the node does not specify a voice or model, the route applies these defaults.

SettingDefaultApplies to
VoiceEXAVITQu4vr4xnSDxMaL (Sarah)Text-to-speech and speech-to-speech
Text-to-speech modeleleven_multilingual_v2Text-to-speech
Speech-to-speech modeleleven_multilingual_sts_v2Speech-to-speech
Output formataudio/mpeg (MP3)All modes

Inputs

  • Prompt / text — required for text-to-speech. The route accepts either text or prompt.
  • Audio input — required for speech-to-speech, supplied as inline base64 (audioBase64) or a public HTTPS URL (audioUrl). The Speech to Speech node resolves this from its connected audio input.
  • Voice — an ElevenLabs voice id. The route enforces voice usage access when shared credentials are used.
  • Voice settings — optional stability, similarity boost, style, speed, and speaker boost, each clamped to its valid range.
  • Text-to-speech extras — language code, seed, surrounding text for continuity, request id stitching, pronunciation dictionaries, and text-normalization options.
Voice settings are clamped
Stability, similarity boost, and style are clamped to the 0–1 range; speed is clamped to its supported voice-speed range; seed is clamped to a valid integer range. Out-of-range values are corrected rather than rejected.

Outputs

On success the node emits MP3 audio on the audio-out port and stores it as durable media. Downstream nodes consume it like any other audio asset, and it remains reusable across sessions.

Generate voice audio

  1. Add a voice node

    Add a Text to Speech or Speech to Speech node. The configuration panel opens automatically so you can choose a voice and model.

  2. Provide the source

    For text-to-speech, type a prompt or connect a text source. For speech-to-speech, connect an audio input on audio-in.

  3. Tune voice and settings

    Pick a voice and adjust stability, similarity, style, or speed. Speech-to-speech can also remove background noise from the source recording.

  4. Run and reuse

    Run the node to generate audio. The stored result flows out of audio-out for downstream steps.

Agent and API notes

Both modes post to /api/generate-audio with a mode field. The route is rate-limited, bound to a canvas for credential scoping, and idempotent per provider operation so a retried run does not double-charge. The examples below show a text-to-speech and a speech-to-speech body.

POST /api/generate-audio — text-to-speechjson
{  "mode": "tts",  "text": "Welcome to Builder Studio.",  "voice": "EXAVITQu4vr4xnSDxMaL",  "model": "eleven_multilingual_v2"}
POST /api/generate-audio — speech-to-speechjson
{  "mode": "sts",  "voice": "EXAVITQu4vr4xnSDxMaL",  "model": "eleven_multilingual_sts_v2",  "audioUrl": "https://example.com/source-recording.mp3"}

Was this page helpful?