ChatTTS - Conversational Text-to-Speech Model

Hey there! ChatTTS is a cutting-edge text-to-speech model for dialogue scenarios like chatbots and virtual assistants. Supporting English and Chinese, it's trained on 100,000+ hours of data to deliver natural, expressive speech. The open-source version on HuggingFace is a 40,000-hour pre-trained model, perfect for research and development.

Why ChatTTS is a Game-Changer 🤩


Q: How much VRAM do I need, and what's the inference speed?

A: For a 30-second audio clip, you'll need at least 4GB of GPU memory. On a 4090 GPU, ChatTTS generates audio at about 7 semantic tokens per second, with a Real-Time Factor (RTF) of around 0.3.

Q: What if the model stability isn't great, with issues like multi-speakers or poor audio quality?

A: This is a common issue with autoregressive models (like Bark and Valle). It can be tricky, but you can try multiple samples to find a suitable result.

Q: Besides laughter, can we control other emotions or elements?

A: Currently, the only token-level control units are [laugh], [uv_break], and [lbreak]. Future versions of ChatTTS may include additional emotional control capabilities, so stay tuned!