Auto-Chunking
Generate speech from inputs that exceed a provider's per-request character limit by splitting on sentence boundaries and stitching the audio.
Most TTS providers cap how much text you can send in a single request. When your input is over the cap, SpeechSDK splits it, renders each piece, and stitches the result back into one audio file — with timestamps reconnected end-to-end. The call signature and result shape don't change.
Quick Start
Pass maxInputChars to override (or supply) a chunking limit on generateSpeech or generateConversation:
import { generateSpeech } from "@speech-sdk/core"
const longText = "First sentence. Second sentence. ... thousands of words ..."
const result = await generateSpeech({
model: "openai/gpt-4o-mini-tts",
text: longText,
voice: "alloy",
maxInputChars: 4000, // override the model default
})
result.audio.uint8Array // single stitched audio fileIf a model declares its own maxInputChars, SpeechSDK uses that automatically — you only set the option to override it or to opt in a model that has no default.
Splitting
Chunks split on sentence boundaries and are balanced so each piece is a similar size. Even splits matter because TTS providers shape prosody — pacing, breath, intonation — across the whole request: a greedy "fill to the cap then break" produces an audible tonal shift at the seam, while balanced cuts blend together. Sentence terminators are detected across ASCII, CJK, Devanagari, and Arabic; paragraph and line breaks are preferred natural break points.
Long-Form generateSpeech
import { generateSpeech } from "@speech-sdk/core"
import { readFileSync } from "node:fs"
const article = readFileSync("./article.txt", "utf8") // 50 kB of text
const result = await generateSpeech({
model: "elevenlabs/eleven_v3",
text: article,
voice: "JBFqnCBsd6RMkjVDRZzb",
maxInputChars: 5000,
output: { format: "mp3" },
timestamps: true,
})
result.audio.mediaType // "audio/mpeg"
result.timestamps // word-level alignment across the full articlegenerateConversation That Exceeds Limits
If any turn would blow past the limit, SpeechSDK forces the conversation onto the stitch path — even when the chosen model supports native multi-speaker dialogue. Each oversize turn is chunked individually; per-turn timestamps stay tagged with the correct turnIndex.
import { generateConversation } from "@speech-sdk/core"
// A long monologue that exceeds the per-request cap.
const longMonologue = "Lorem ipsum dolor sit amet… (thousands of words)"
const result = await generateConversation({
turns: [
{
model: "openai/gpt-4o-mini-tts",
voice: "alloy",
text: "A short opening line.",
},
{
model: "openai/gpt-4o-mini-tts",
voice: "verse",
text: longMonologue,
},
],
maxInputChars: 4000,
timestamps: true,
})Gateway Routing
Calls routed through the Speechbase skip client-side chunking. The gateway handles input limits server-side, so maxInputChars is ignored on the wire.
Errors
If chunking is required but the selected provider can't produce a format the SDK can stitch, the call throws TextChunkingUnsupportedError. Pick a different model, lower maxInputChars, or shorten the input.
import { TextChunkingUnsupportedError, generateSpeech } from "@speech-sdk/core"
const veryLongText = "…20 kB of input text…"
try {
await generateSpeech({
model: "some-provider/exotic-model",
text: veryLongText,
voice: "voice-id",
maxInputChars: 2000,
})
} catch (error) {
if (error instanceof TextChunkingUnsupportedError) {
// Provider can't produce a decodable format for stitching.
}
}Notes
- Per-model
maxInputCharsdefaults exist for OpenAI, ElevenLabs, fal, Hume, Inworld, Deepgram, and xAI. A caller-supplied value always wins. - Pronunciation rules and output conversion apply to the final stitched audio just like any other call.
streamSpeechdoes not chunk.