ChatTTS - The Ultimate Conversational TTS Model

Why ChatTTS is a Game-Changer 🤩

Conversational TTS: Optimized for dialogue, ChatTTS enables natural, expressive speech, supporting multiple speakers for interactive conversations.
Fine-grained Control: Predict and control prosodic features like laughter, pauses, and interjections for added realism.
Better Prosody: ChatTTS excels in prosody, surpassing most open-source TTS models to deliver a lifelike experience.

FAQ

Q: How much VRAM do I need, and what's the inference speed?

A: For a 30-second audio clip, you'll need at least 4GB of GPU memory. On a 4090 GPU, ChatTTS generates audio at about 7 semantic tokens per second, with a Real-Time Factor (RTF) of around 0.3.

Q: What if the model stability isn't great, with issues like multi-speakers or poor audio quality?

A: This is a common issue with autoregressive models (like Bark and Valle). It can be tricky, but you can try multiple samples to find a suitable result.

Q: Besides laughter, can we control other emotions or elements?

A: Currently, the only token-level control units are [laugh], [uv_break], and [lbreak]. Future versions of ChatTTS may include additional emotional control capabilities, so stay tuned!