Cartesia


Prefix	`cartesia`
Default model	`sonic-3`
Env var	`CARTESIA_API_KEY`
Official docs	docs.cartesia.ai

Models

Model	Streaming	Audio Tags	Voice Cloning	Notes
`sonic-3`	Yes	Yes (via SSML)	Yes	Current flagship; emotion tags
`sonic-2`	Yes	No	No	Previous generation

Usage

import { generateSpeech } from "@speech-sdk/core"

const result = await generateSpeech({
  model: "cartesia/sonic-3",
  text: "Hello from SpeechSDK!",
  voice: "a0e99841-438c-4a64-b679-ae501e7d6091",
})

Default output is audio/wav at 44.1 kHz.

Audio Tags

sonic-3 supports audio tags with two paths:

Emotion tags ([happy], [sad], [angry], [excited], etc.) are converted to Cartesia's SSML <emotion> elements.
[laughter] is passed through natively.
Unknown tags are stripped with a warning.

await generateSpeech({
  model: "cartesia/sonic-3",
  text: "[happy] What a lovely day! [laughter] I can't believe it.",
  voice: "a0e99841-438c-4a64-b679-ae501e7d6091",
})

Voice Cloning

sonic-3 supports inline voice cloning from reference audio. See Voice Cloning for details.

await generateSpeech({
  model: "cartesia/sonic-3",
  text: "Hello in a cloned voice!",
  voice: { audio: "base64-encoded-audio..." },
})

Provider Options

await generateSpeech({
  model: "cartesia/sonic-3",
  text: "Hello!",
  voice: "a0e99841-438c-4a64-b679-ae501e7d6091",
  providerOptions: {
    language: "en",
    output_format: {
      container: "wav",
      encoding: "pcm_s16le",
      sample_rate: 44_100,
    },
    speed: "normal",
  },
})

Custom Configuration

import { generateSpeech } from "@speech-sdk/core"
import { createCartesia } from "@speech-sdk/core/providers"

const cartesia = createCartesia({
  apiKey: process.env.CARTESIA_API_KEY,
})

const result = await generateSpeech({
  model: cartesia("sonic-3"),
  text: "Hello!",
  voice: "a0e99841-438c-4a64-b679-ae501e7d6091",
})

Cartesia

Models

Usage

Audio Tags

Voice Cloning

Provider Options

Custom Configuration

On this page