ListenSpeedScoresπŸ—³ Vote β†—

Scores

Objective scores over the same 5 prompts. UTMOS = predicted naturalness (higher better); WER = ASR word-error rate vs the intended text β€” a failure-detector, not a fine ranking (lower better); SIM = speaker similarity to the cloned reference (chris_hemsworth_15s, higher better). Switch Default / Cloning below; click any header to re-sort. Each score is the mean over the exact clips shown on Listen. Human votes are the preference ground truth; these objective metrics are backstops.
Voice:
Default voiceCloning

Default voice Β· naturalness + intelligibility

ModelSizeUTMOS ↑WER ↓
Chatterbox1.2B4.4230.061
Chatterbox Turbo744M4.2740.074
Coqui XTTS-v2750M4.0560.065
Dia 1.6B-06261.6B2.3870.317
dots.tts (soar)2B3.2270.059
DramaBox3.3B4.3040.093
F5-TTS v1330M4.0810.107
Higgs Audio v3 TTS4B4.3720.065
IndexTTS-21.5B4.2570.086
KittenTTS Nano 0.1<100M3.6650.093
Kokoro82M4.3020.065
LuxTTS123M3.5400.092
Magpie-TTS357M4.1990.087
Mars5-TTS1.2B3.5400.361
Maya13B4.4870.066
MeloTTS~52M3.5790.072
MiraTTS0.5B3.8030.100
NeuTTS Air748M4.0030.117
NeuTTS Nano229M3.5720.066
OmniVoice~1B4.1040.040
OuteTTS 1.0 1B1B4.3860.070
Parler-TTS Mini v1878M3.7570.149
Piper~25MB4.0770.066
Pocket-TTS100M4.0970.054
Qwen3-TTS 1.7B (CUDA-graph)1.7B4.3230.065
Qwen3-TTS 1.7B Base1.7B4.2760.096
Sesame CSM-1B1B4.1520.114
Soprano 1.1 80M80M4.1160.059
Step-Audio-EditX3B4.3990.044
StyleTTS 2~148M4.2590.158
Supertonic 399M4.1950.065
VibeVoice Realtime 0.5B0.5B4.0430.148
VoxCPM2 2B2B3.4800.023
Voxtral 4B TTS4B3.6920.081
Scored over the 5 bench prompts (thin β€” WER is a failure-detector, not a fine ranking). Checkpoints: UTMOS utmos22_strong (SpeechMOS), SIM canonical UniSpeech-SAT wavlm_large_finetune, WER Whisper-large-v3. Method follows seed-tts-eval. Human votes are the preference ground truth; these are objective backstops.