The Unified Text-to-Speech SDK
The SpeechSDK is a free, open-source toolkit for building better AI audio applications with multiple voice providers.
import { generateSpeech } from '@speech-sdk/core';
const result = await generateSpeech({
model: 'openai/gpt-4o-mini-tts',
text: 'Hello from SpeechSDK!',
voice: 'alloy',
});
result.audio.uint8Array; // Uint8Array
result.audio.base64; // string (lazy)
result.audio.mediaType; // "audio/mpeg"import { generateSpeech } from '@speech-sdk/core';
const result = await generateSpeech({
model: 'elevenlabs/eleven_v3',
text: 'Hello from SpeechSDK!',
voice: 'EXAVITQu4vr4xnSDxMaL',
});
result.audio.uint8Array; // Uint8Array
result.audio.base64; // string (lazy)
result.audio.mediaType; // "audio/mpeg"import { generateSpeech } from '@speech-sdk/core';
const result = await generateSpeech({
model: 'cartesia/sonic-3',
text: 'Hello from SpeechSDK!',
voice: 'a0e99841-438c-4a64-b679-ae501e7d6091',
});
result.audio.uint8Array; // Uint8Array
result.audio.base64; // string (lazy)
result.audio.mediaType; // "audio/mpeg"import { generateSpeech } from '@speech-sdk/core';
const result = await generateSpeech({
model: 'google/gemini-3.1-flash-tts-preview',
text: 'Hello from SpeechSDK!',
voice: 'Kore',
});
result.audio.uint8Array; // Uint8Array
result.audio.base64; // string (lazy)
result.audio.mediaType; // "audio/wav"import { generateSpeech } from '@speech-sdk/core';
const result = await generateSpeech({
model: 'xai/grok-tts',
text: 'Hello from SpeechSDK!',
voice: 'ava',
});
result.audio.uint8Array; // Uint8Array
result.audio.base64; // string (lazy)
result.audio.mediaType; // "audio/mpeg"import { generateSpeech } from '@speech-sdk/core';
const result = await generateSpeech({
model: 'inworld/inworld-tts-1.5-max',
text: 'Hello from SpeechSDK!',
voice: 'Ashley',
});
result.audio.uint8Array; // Uint8Array
result.audio.base64; // string (lazy)
result.audio.mediaType; // "audio/mpeg"One API, Every Provider
One interface across OpenAI, ElevenLabs, Deepgram, Cartesia, Google, Mistral, Hume, and more. Unified model strings, consistent response format, BYO API keys.
Multi-Speaker Conversations
Generate a multi-speaker conversation with a single API. Mix voices across providers in one call, with automatic volume leveling and turn stitching.
Auto-Chunking & Timestamps
Intelligently splits long inputs on sentence boundaries for providers with max input lengths.
Why SpeechSDK?
Locking into a single TTS provider's SDK means rewriting code when a better or less expensive model ships.
The SpeechSDK integrates all major providers into an easy-to-use, unified interface so you can swap models without breaking your application code.
Supports
import { generateConversation } from "@speech-sdk/core";
const result = await generateConversation({
turns: [
{
model: "elevenlabs/eleven_v3",
voice: "EXAVITQu4vr4xnSDxMaL",
text: "Hello from the SDK.",
},
{
model: "google/gemini-3.1-flash-tts-preview",
voice: "Kore",
text: "One call. Multiple voices. Auto-leveled.",
},
],
});
result.audio.uint8Array; // Uint8Array
result.audio.mediaType; // "audio/mpeg"AI Engineering
For Production Voice Applications
Smart retries
Jittered exponential backoff retries 5xx and 429 automatically. 429s honor Retry-After (60s cap) and expose the delay via ApiError.retryAfterMs.
Long inputs, handled
maxInputChars splits at sentence boundaries, stitches chunks into one audio file, and reconnects word-level timestamps end-to-end.
Format conversions
Render wav, mp3, or pcm from any provider. Native pass-through where supported, lossless local conversion otherwise.
Custom fetch & Base URL
Every provider accepts a custom fetch and baseURL — point at OpenAI-compatible proxies, Azure, LiteLLM, or local models.
Words and captions
Word-level timestamps from native alignment or a one-shot STT fallback. timestampsToCaptions ships SRT or WebVTT in a single call.
Speechbase ready
Queuing, quality processing, voice management, and analytics — one config change to connect. Coming soon.
PROVIDERS
Every model, one interface
| Provider | Model String | Default* |
|---|---|---|
| OpenAI | openai/gpt-4o-mini-tts | Yes |
| ElevenLabs | elevenlabs/eleven_v3 | Yes |
| ElevenLabs | elevenlabs/eleven_flash_v2_5 | — |
| ElevenLabs | elevenlabs/eleven_flash_v2 | — |
| Deepgram | deepgram/aura-2 | Yes |
| Cartesia | cartesia/sonic-3 | Yes |
| Hume | hume/octave-2 | Yes |
| google/gemini-3.1-flash-tts-preview | Yes | |
| Fish Audio | fish-audio/s2-pro | Yes |
| Inworld | inworld/inworld-tts-1.5-max | Yes |
| Murf | murf/GEN2 | Yes |
| Smallest AI | smallest-ai/lightning-v3.1 | Yes |
| Resemble | resemble/default | Yes |
| fal | fal-ai/* | — |
| Mistral | mistral/voxtral-mini-tts-2603 | Yes |
| xAI | xai/grok-tts | Yes |
* Pass just the provider name to use its default model — e.g. model: 'openai' resolves to openai/gpt-4o-mini-tts.
Frequently asked questions
Each provider has its own SDK, request format, auth pattern, and response shape. SpeechSDK is one API, every provider — same function call, same result type, same error handling. Switch providers by simply changing a model string.
One SDK, every provider. Add text-to-speech to your app in minutes with a unified, open-source interface.