Timestamps & Captions
Return word-level alignment from generateSpeech and convert it into SRT or WebVTT captions.
Pass timestamps: true to generateSpeech or generateConversation to get word-level alignment alongside the audio. Feed the result to timestampsToCaptions for an SRT or WebVTT caption file — no extra API calls required when the provider returns alignment natively.
Quick Start
import { generateSpeech } from "@speech-sdk/core"
const result = await generateSpeech({
model: "elevenlabs/eleven_v3",
text: "Hello from SpeechSDK!",
voice: "JBFqnCBsd6RMkjVDRZzb",
timestamps: true,
})
result.timestamps
// [
// { text: "Hello", start: 0.00, end: 0.32 },
// { text: "from", start: 0.36, end: 0.55 },
// { text: "SpeechSDK!", start: 0.58, end: 1.12 },
// ]Timings are always word-granularity, with start and end measured in seconds from the beginning of the generated audio. Providers that natively return character- or phoneme-level data are aggregated into words internally.
WordTimestamp
interface WordTimestamp {
readonly text: string
readonly start: number // seconds
readonly end: number // seconds
}How timestamps Works
timestamps is a boolean. The behavior depends on whether the model has native alignment:
timestamps | Native-alignment model | No native alignment |
|---|---|---|
true | Returned in the TTS response — no extra calls. | SDK transcribes the generated audio via the configured STT fallback to recover timings. |
false (default) | Not returned. | Not returned. |
await generateSpeech({
model: "openai/gpt-4o-mini-tts",
text: "Hello!",
voice: "alloy",
timestamps: true, // forces STT fallback — OpenAI has no native TTS alignment
})When timestamps: true is omitted (or set to false), result.timestamps is undefined, even on providers that would have returned alignment for free.
STT Fallback
With timestamps: true and a TTS provider that lacks native alignment, SpeechSDK transcribes the generated audio to recover word timings. The default fallback is OpenAI Whisper (openai/whisper-1, reads OPENAI_API_KEY). Expect extra cost and latency on this path.
Override the fallback by passing fallbackSTT to the provider factory when you create the client:
import { generateSpeech } from "@speech-sdk/core"
import { createCartesia, createOpenAI } from "@speech-sdk/core/providers"
const cartesia = createCartesia({
apiKey: process.env.CARTESIA_API_KEY,
fallbackSTT: createOpenAI({ apiKey: process.env.MY_WHISPER_KEY }).stt("whisper-1"),
})
await generateSpeech({
model: cartesia("sonic-3"),
text: "Hello!",
voice: "voice-id",
timestamps: true,
})Every provider factory (createElevenLabs, createCartesia, createOpenAI, createInworld, …) accepts an optional fallbackSTT: ResolvedSTTModel. When you reference a model by string (e.g. "openai/gpt-4o-mini-tts"), the SDK uses the default Whisper fallback unless you pre-create the client with fallbackSTT and pass the resolved model in.
If timestamps: true is requested but the fallback API key is missing, SpeechSDK throws TimestampKeyMissingError naming the env var you'd need to set.
Provider Support
| Provider | Native timestamps? |
|---|---|
ElevenLabs (eleven_v3, eleven_multilingual_v2, eleven_flash_v2, eleven_flash_v2_5) | Yes — returned in the TTS response |
Murf (GEN2) | Yes — wordDurations from the TTS response (FALCON streaming model excluded) |
Hume (octave-2) | Yes — word alignment from the JSON /v0/tts endpoint (octave-1 not supported) |
Inworld (inworld-tts-1.5-max, inworld-tts-1.5-mini) | Yes — timestampInfo.wordAlignment (best on English/Spanish) |
Cartesia (sonic-3, sonic-2) | Yes — SSE endpoint with add_timestamps: true |
Resemble (default) | Yes — audio_timestamps from /synthesize, aggregated into words |
| OpenAI, Deepgram, Google, Fish Audio, fal, Mistral, xAI | No native alignment. timestamps: true routes through the STT fallback |
Each ModelInfo declares its capabilities in a features array — models with native alignment include "timestamps". You can inspect the array on a resolved model:
import { createElevenLabs } from "@speech-sdk/core/providers"
const elevenlabs = createElevenLabs({ apiKey: process.env.ELEVENLABS_API_KEY })
const model = elevenlabs("eleven_v3")
const info = model.provider.models.find((m) => m.id === model.modelId)
const hasNativeTimestamps = info?.features.includes("timestamps") ?? falseWhen hasNativeTimestamps is false, timestamps: true will route through the STT fallback (Whisper by default).
Multi-Speaker Conversations
generateConversation accepts the same timestamps option. The result is a ConversationWordTimestamp[] — every word carries a turnIndex pointing back at the turn that produced it, so you can derive per-speaker time ranges without doing any time-bucketing yourself.
- Stitch path — per-turn timings are offset by the cumulative turn duration plus the inter-turn gap;
turnIndexis set during compose. - Native dialogue path — provider alignment on the mixed audio, with
turnIndexattributed via greedy text-matching againstturns[i].text.
import { generateConversation } from "@speech-sdk/core"
const result = await generateConversation({
turns: [
{ model: "elevenlabs/eleven_v3", voice: "JBFqnCBsd6RMkjVDRZzb", text: "Hello!" },
{ model: "elevenlabs/eleven_v3", voice: "EXAVITQu4vr4xnSDxMaL", text: "Hi there." },
],
timestamps: true,
})
result.timestamps
// [
// { text: "Hello!", start: 0.00, end: 0.42, turnIndex: 0 },
// { text: "Hi", start: 0.72, end: 0.90, turnIndex: 1 },
// { text: "there.", start: 0.91, end: 1.18, turnIndex: 1 },
// ]For the full pattern — aggregating words into per-speaker time spans and mapping each span back to its voice — see Multi-Speaker Conversation › Timestamps.
Captions (SRT / WebVTT)
Use timestampsToCaptions to turn word-level timestamps into a caption file. SRT is the default; pass format: "vtt" for WebVTT (required for the HTML <track> element).
import { generateSpeech, timestampsToCaptions } from "@speech-sdk/core"
const { timestamps } = await generateSpeech({
model: "elevenlabs/eleven_v3",
text: "Hello world. This is a test.",
voice: "JBFqnCBsd6RMkjVDRZzb",
timestamps: true,
})
const srt = timestampsToCaptions(timestamps ?? [])
// 1
// 00:00:00,000 --> 00:00:01,200
// Hello world.
//
// 2
// 00:00:01,300 --> 00:00:02,800
// This is a test.
const vtt = timestampsToCaptions(timestamps ?? [], { format: "vtt" })
// WEBVTT
//
// 1
// 00:00:00.000 --> 00:00:01.200
// Hello world.
//
// 2
// 00:00:01.300 --> 00:00:02.800
// This is a test.Output follows the SubRip and W3C WebVTT conventions: comma-decimal (SRT) vs period-decimal (VTT) timestamps, sequential numeric cue IDs, blank-line cue separators with a trailing blank line, and HTML-escaped body text (&, <, >) on the VTT path.
Cues break on sentence boundaries (., !, ?, along with CJK, Devanagari, and Arabic equivalents). Long sentences are subdivided by character count, cue duration, and soft comma breaks.
Options
interface CaptionsOptions {
format?: "srt" | "vtt" // default: "srt"
maxLineLength?: number // default: 42
maxLinesPerCue?: number // default: 2
maxCharsPerCue?: number // default: maxLineLength * maxLinesPerCue
maxCueDurationMs?: number // default: 7000
longPhraseCommaBreakChars?: number // default: 60
}| Option | Purpose |
|---|---|
format | "srt" or "vtt". VTT is required for HTML <track>. |
maxLineLength | Characters per line (word-boundary wrap). 42 is the common broadcast convention for Latin-alphabet subtitles; try 16 for CJK content. |
maxLinesPerCue | Hard ceiling on lines in a single cue. |
maxCharsPerCue | Hard ceiling on characters in a single cue before SpeechSDK forces a cue break. |
maxCueDurationMs | Hard ceiling on cue length; a cue that would exceed this is split at the next word boundary. |
longPhraseCommaBreakChars | Minimum cue character count at which a trailing comma triggers a soft cue break. Prevents tiny fragments after every comma. |
Serving as an HTML <track>
import { generateSpeech, timestampsToCaptions } from "@speech-sdk/core"
export async function GET() {
const { timestamps } = await generateSpeech({
model: "elevenlabs/eleven_v3",
text: "Hello world.",
voice: "JBFqnCBsd6RMkjVDRZzb",
timestamps: true,
})
const vtt = timestampsToCaptions(timestamps ?? [], { format: "vtt" })
return new Response(vtt, {
headers: { "Content-Type": "text/vtt" },
})
}<video>
<source src="/audio.mp3" />
<track default src="/captions.vtt" kind="captions" srclang="en" />
</video>Writing SRT to Disk
import { writeFileSync } from "node:fs"
writeFileSync("captions.srt", timestampsToCaptions(timestamps ?? []))