Day 0 support for Google Gemini 3.1 Flash TTS Try it now →

Multi-Speaker Conversation

Generate a single audio file from a multi-turn, multi-voice script with generateConversation.

Use generateConversation to turn a script of turns into a single audio file — each turn rendered with its own model and voice, concatenated and volume-leveled. One call, multiple voices, any mix of providers.

Quick Start

import { generateConversation } from "@speech-sdk/core"

const result = await generateConversation({
  turns: [
    {
      model: "inworld/inworld-tts-1.5-max",
      voice: "Ashley",
      text: "Welcome to the Speech SDK — we just shipped conversation mode.",
    },
    {
      model: "elevenlabs/eleven_v3",
      voice: "JBFqnCBsd6RMkjVDRZzb",
      text: "That's right — one call, multiple voices, automatic volume leveling.",
    },
  ],
})

result.audio.uint8Array // Uint8Array
result.audio.mediaType // e.g. "audio/mpeg"

Each turn accepts the same model and voice values you'd pass to generateSpeech — a provider/model string, a bare provider name, or a configured ResolvedModel.

How It Works

SpeechSDK picks the cheapest execution path that satisfies your script:

  • Native dialogue — when every turn uses the same model and that model has native multi-speaker support (e.g. some providers' dialogue models), SpeechSDK makes a single API call and lets the provider render the full conversation.
  • Stitch — otherwise SpeechSDK generates each turn independently, transcodes them to a common media type, and concatenates the chunks into one file. This is how you can mix providers in a single conversation.

Either way, the call signature and result shape are identical. You don't need to think about which path is taken.

Options

interface GenerateConversationOptions {
  turns: ConversationTurn[]
  normalizeVolume?: boolean // default: true
  timestamps?: boolean // default: false
  abortSignal?: AbortSignal
}

interface ConversationTurn {
  model: string | ResolvedModel
  voice: Voice
  text: string
}

Volume Normalization

Different providers output audio at different loudness levels. By default, SpeechSDK normalizes each turn to a consistent loudness target before concatenating, so no single speaker is noticeably quieter or louder than the others.

Set normalizeVolume: false to skip this step when you want the raw output:

await generateConversation({
  turns,
  normalizeVolume: false,
})

Timestamps

Pass timestamps to get a ConversationWordTimestamp[] — every word carries a turnIndex pointing back at the turn that produced it. That makes it trivial to derive per-speaker time ranges.

const result = await generateConversation({
  turns: [
    { model: "elevenlabs/eleven_v3", voice: "rachel", text: "Hi there." },
    { model: "elevenlabs/eleven_v3", voice: "adam",   text: "Hello!" },
    { model: "elevenlabs/eleven_v3", voice: "rachel", text: "How are you?" },
  ],
  timestamps: true,
})

result.timestamps
// [
//   { text: "Hi",       start: 0.00, end: 0.15, turnIndex: 0 },
//   { text: "there.",   start: 0.16, end: 0.42, turnIndex: 0 },
//   { text: "Hello!",   start: 0.72, end: 1.05, turnIndex: 1 },
//   { text: "How",      start: 1.35, end: 1.50, turnIndex: 2 },
//   { text: "are",      start: 1.51, end: 1.66, turnIndex: 2 },
//   { text: "you?",     start: 1.67, end: 1.92, turnIndex: 2 },
// ]

Aggregating words into per-speaker time ranges

Group consecutive same-turnIndex words to get one span per turn, then map each span to the voice that produced it via turns[turnIndex].voice:

// [
//   { turnIndex: 0, voice: "rachel", start: 0.00, end: 0.42, text: "Hi there." },
//   { turnIndex: 1, voice: "adam",   start: 0.72, end: 1.05, text: "Hello!" },
//   { turnIndex: 2, voice: "rachel", start: 1.35, end: 1.92, text: "How are you?" },
// ]

You can now answer "which voice is playing right now?" by binary-searching spans for the current playback time, drive a chat-bubble UI, or render speaker-attributed transcripts. The start of span i+1 is the natural hand-off point — the (spans[i].end, spans[i+1].start) window is the inter-turn silence (controlled by gapMs).

See the timestamps & captions guide for mode semantics, provider support, and converting the result into SRT or WebVTT via timestampsToCaptions.

Aborting

Pass an AbortSignal to cancel an in-flight conversation — any outstanding per-turn requests are aborted too:

const controller = new AbortController()

const result = await generateConversation({
  turns,
  abortSignal: controller.signal,
})

setTimeout(() => controller.abort(), 5000)

Mixing Providers

Any combination of providers that each support single-turn generation can be stitched together. This is useful when you want a specific voice from one provider paired with a specific voice from another:

import { generateConversation } from "@speech-sdk/core"
import { createInworld, createElevenLabs } from "@speech-sdk/core/providers"

const inworld = createInworld({ apiKey: process.env.INWORLD_API_KEY })
const eleven = createElevenLabs({ apiKey: process.env.ELEVENLABS_API_KEY })

const result = await generateConversation({
  turns: [
    { model: inworld(), voice: "Ashley", text: "Hi!" },
    { model: eleven("eleven_v3"), voice: "EXAVITQu4vr4xnSDxMaL", text: "Hello!" },
    { model: inworld(), voice: "Ashley", text: "Nice to meet you." },
  ],
})

Limits

  • Every turn must include a non-empty text.
  • Every provider referenced in the script must support either native dialogue (for single-provider scripts) or stitching (for multi-provider scripts).

There is no cap on the number of unique voices in a conversation — you can mix as many distinct (provider, voice) pairs as your script needs, and the same voice may reappear in any number of turns.

On this page