Why ChatTTS is a Game-Changer 🤩
- Conversational TTS: Optimized for dialogue, ChatTTS enables natural, expressive speech, supporting multiple speakers for interactive conversations.
- Fine-grained Control: Predict and control prosodic features like laughter, pauses, and interjections for added realism.
- Better Prosody: ChatTTS excels in prosody, surpassing most open-source TTS models to deliver a lifelike experience.
FAQ
Q: How much VRAM do I need, and what's the inference speed?
A: For a 30-second audio clip, you'll need at least 4GB of GPU memory. On a 4090 GPU, ChatTTS generates audio at about 7 semantic tokens per second, with a Real-Time Factor (RTF) of around 0.3.
Q: What if the model stability isn't great, with issues like multi-speakers or poor audio quality?
A: This is a common issue with autoregressive models (like Bark and Valle). It can be tricky, but you can try multiple samples to find a suitable result.
Q: Besides laughter, can we control other emotions or elements?
A: Currently, the only token-level control units are [laugh], [uv_break], and [lbreak]. Future versions of ChatTTS may include additional emotional control capabilities, so stay tuned!