chris_hemsworth_15s, higher better). Switch Default / Cloning below; click any header to re-sort. Each score is the mean over the exact clips shown on Listen. Human votes are the preference ground truth; these objective metrics are backstops.| Model | Size | UTMOS β | WER β |
|---|---|---|---|
| Chatterbox | 1.2B | 4.423 | 0.061 |
| Chatterbox Turbo | 744M | 4.274 | 0.074 |
| Coqui XTTS-v2 | 750M | 4.056 | 0.065 |
| Dia 1.6B-0626 | 1.6B | 2.387 | 0.317 |
| dots.tts (soar) | 2B | 3.227 | 0.059 |
| DramaBox | 3.3B | 4.304 | 0.093 |
| F5-TTS v1 | 330M | 4.081 | 0.107 |
| Higgs Audio v3 TTS | 4B | 4.372 | 0.065 |
| IndexTTS-2 | 1.5B | 4.257 | 0.086 |
| KittenTTS Nano 0.1 | <100M | 3.665 | 0.093 |
| Kokoro | 82M | 4.302 | 0.065 |
| LuxTTS | 123M | 3.540 | 0.092 |
| Magpie-TTS | 357M | 4.199 | 0.087 |
| Mars5-TTS | 1.2B | 3.540 | 0.361 |
| Maya1 | 3B | 4.487 | 0.066 |
| MeloTTS | ~52M | 3.579 | 0.072 |
| MiraTTS | 0.5B | 3.803 | 0.100 |
| NeuTTS Air | 748M | 4.003 | 0.117 |
| NeuTTS Nano | 229M | 3.572 | 0.066 |
| OmniVoice | ~1B | 4.104 | 0.040 |
| OuteTTS 1.0 1B | 1B | 4.386 | 0.070 |
| Parler-TTS Mini v1 | 878M | 3.757 | 0.149 |
| Piper | ~25MB | 4.077 | 0.066 |
| Pocket-TTS | 100M | 4.097 | 0.054 |
| Qwen3-TTS 1.7B (CUDA-graph) | 1.7B | 4.323 | 0.065 |
| Qwen3-TTS 1.7B Base | 1.7B | 4.276 | 0.096 |
| Sesame CSM-1B | 1B | 4.152 | 0.114 |
| Soprano 1.1 80M | 80M | 4.116 | 0.059 |
| Step-Audio-EditX | 3B | 4.399 | 0.044 |
| StyleTTS 2 | ~148M | 4.259 | 0.158 |
| Supertonic 3 | 99M | 4.195 | 0.065 |
| VibeVoice Realtime 0.5B | 0.5B | 4.043 | 0.148 |
| VoxCPM2 2B | 2B | 3.480 | 0.023 |
| Voxtral 4B TTS | 4B | 3.692 | 0.081 |
utmos22_strong (SpeechMOS), SIM canonical UniSpeech-SAT wavlm_large_finetune, WER Whisper-large-v3. Method follows seed-tts-eval. Human votes are the preference ground truth; these are objective backstops.