Speed Control
Slow down or speed up generated audio with the `speed` parameter on generateSpeech and generateConversation.
Pass speed to generateSpeech or generateConversation to time-stretch the final audio without changing pitch. 1 is unchanged, values below 1 are slower, values above 1 are faster.
Quick Start
import { generateSpeech } from "@speech-sdk/core"
const result = await generateSpeech({
model: "openai/gpt-4o-mini-tts",
text: "Hello!",
voice: "alloy",
speed: 1.25,
})
result.audio.mediaType // "audio/mpeg"Range
speed must be a finite number between 0.75 and 1.5. Passing a value outside that range throws RangeError. 1 (or omitting the parameter) is a no-op.
How It Works
The SDK uses a WSOLA-based time-stretching step on mono PCM audio. It decodes the provider's audio, stretches in the time domain, and re-encodes to your requested output format.
- Direct provider path — time-stretch happens locally in the SDK. To avoid a wasted decode/re-encode round-trip, when
speedis set the SDK requests a decodable wire format (PCM/WAV) from the provider and applies the finaloutputformat conversion as part of the stretch step. - Gateway path —
speedis forwarded in the wire payload toapi.speechbase.ai, which applies the stretch server-side. The gateway invariant (one request, one billed call) is preserved.
timestamps and audioDurationMs on the result are scaled by 1 / speed, so word alignment and reported duration match the actual stretched audio.
Output Format
When speed is set without an explicit output, the stretched audio is encoded as mp3 — matching what most providers return natively. Set output explicitly to override:
const result = await generateSpeech({
model: "openai/gpt-4o-mini-tts",
text: "Hello!",
voice: "alloy",
speed: 0.85,
output: { format: "wav" },
})
result.audio.mediaType // "audio/wav"Conversations
generateConversation accepts speed at the top level and per-turn:
import { generateConversation } from "@speech-sdk/core"
const result = await generateConversation({
model: "elevenlabs/eleven_v3",
speed: 1.1, // applies to every turn that doesn't set its own
turns: [
{ voice: "voice-a", text: "Welcome to the show." },
{ voice: "voice-b", text: "Glad to be here.", speed: 0.9 },
{ voice: "voice-a", text: "Let's get started." },
],
})A per-turn speed forces the stitch path on direct providers, since a single multi-speaker request can't carry per-turn stretch settings. On the gateway path, both top-level and per-turn speed are forwarded as-is.
Notes
- Mono only. Stereo or multi-channel input is not stretched; pass mono audio (which is what every supported provider returns).
- Stretching happens in the time domain, so pitch is preserved. Extreme values toward the edges of the supported range may introduce mild artifacts; stay close to
1for the cleanest result. - The same stretch primitive is exported from
@speech-sdk/core/pluginsastimeStretchif you want to apply it outside the SDK pipeline.