speechbench
Cross-model ASR comparison — every model × every dataset × 30 clips, on a single GCP T4 spot VM.
Tables are sortable — click any column header. Green = best WER in the dataset. Red = hallucination (WER > 100% means the model generated more output than the reference).
Model names link to their HuggingFace pages. Dataset titles link to their HF datasets.
Model
Backend
n
WER
CER
RTFx mean
RTFx p50
Lat mean (ms)
Lat p90 (ms)
GPU peak (MB)
Wall (s)
whisper-large-v3-turbo
transformers
300
3.35%
1.43%
22.6
22.8
651
796
272
212
whisper-large-v3
transformers
300
3.56%
1.58%
5.3
5.3
2840
3675
1448
869
fw-large-v3-turbo
faster-whisper
300
3.62%
1.53%
28.1
28.0
527
570
328
175
whisper-large-v2
transformers
300
3.73%
1.53%
4.8
4.9
3104
3985
1522
966
whisper-medium
transformers
300
4.59%
1.57%
7.5
7.6
2001
2595
892
617
gemma-4-E4B-it
transformers
30
4.76%
1.59%
0.2
0.2
72304
95251
5532
2192
fw-large-v3
faster-whisper
300
5.25%
3.15%
11.7
11.9
1279
1677
1288
401
gemma-4-E2B-it
transformers
30
6.16%
1.97%
4.7
4.6
3310
4200
30
102
parakeet-tdt-0.6b-v3
nemo
300
6.73%
2.23%
77.2
77.0
196
233
262
75
whisper-small
transformers
300
7.11%
2.42%
15.6
15.8
961
1262
352
305
whisper-base
transformers
300
13.21%
4.32%
29.9
30.3
501
648
174
167
whisper-tiny
transformers
300
19.83%
6.34%
39.0
39.1
385
509
92
133
Model
Backend
n
WER
CER
RTFx mean
RTFx p50
Lat mean (ms)
Lat p90 (ms)
GPU peak (MB)
Wall (s)
whisper-large-v3
transformers
300
2.42%
0.92%
4.9
4.8
2514
3475
0
760
fw-large-v3
faster-whisper
300
2.67%
1.07%
10.9
10.7
1111
1455
416
338
whisper-large-v2
transformers
300
2.76%
0.99%
4.8
4.7
2568
3587
0
776
whisper-large-v3-turbo
transformers
300
2.78%
1.08%
20.6
20.2
586
755
0
180
fw-large-v3-turbo
faster-whisper
300
2.80%
1.14%
24.0
23.8
499
583
288
155
parakeet-tdt-0.6b-v3
nemo
300
3.37%
1.25%
82.0
85.1
148
211
96
50
whisper-medium
transformers
300
3.50%
1.22%
7.5
7.4
1644
2318
0
498
gemma-4-E4B-it
transformers
30
3.87%
1.40%
0.2
0.2
54595
72283
5466
1644
whisper-small
transformers
300
5.04%
1.63%
15.9
15.6
777
1090
0
239
gemma-4-E2B-it
transformers
30
5.21%
2.66%
4.6
4.4
2579
3383
38
80
whisper-base
transformers
300
9.44%
2.96%
30.4
29.5
408
594
0
127
whisper-tiny
transformers
300
15.76%
4.83%
39.2
38.6
315
445
0
99
Model
Backend
n
WER
CER
RTFx mean
RTFx p50
Lat mean (ms)
Lat p90 (ms)
GPU peak (MB)
Wall (s)
parakeet-tdt-0.6b-v3
nemo
300
6.37%
4.03%
66.1
70.5
182
268
620
75
whisper-large-v2
transformers
300
7.49%
5.19%
3.9
3.9
2986
4850
40
910
whisper-large-v3
transformers
300
7.65%
5.33%
4.0
4.0
2956
4775
80
901
fw-large-v3
faster-whisper
300
7.80%
5.42%
9.1
9.1
1258
1941
448
391
fw-large-v3-turbo
faster-whisper
300
7.82%
5.45%
21.2
21.0
545
692
288
177
whisper-medium
transformers
300
8.15%
5.44%
6.1
6.0
1928
3184
56
592
gemma-4-E4B-it
transformers
30
10.46%
7.37%
0.2
0.2
64811
111409
5472
1959
whisper-large-v3-turbo
transformers
300
12.60%
8.84%
17.3
17.2
673
981
40
216
whisper-small
transformers
300
16.16%
11.65%
12.9
12.7
977
1584
90
306
gemma-4-E2B-it
transformers
30
16.70%
13.11%
5.8
3.9
2827
4899
40
97
whisper-base
transformers
300
19.04%
13.07%
24.9
24.6
501
814
8
220
whisper-tiny
transformers
300
30.47%
18.04%
31.9
32.2
412
622
6
138
Model
Backend
n
WER
CER
RTFx mean
RTFx p50
Lat mean (ms)
Lat p90 (ms)
GPU peak (MB)
Wall (s)
parakeet-tdt-0.6b-v3
nemo
300
5.16%
1.88%
63.2
64.3
96
116
2
32
fw-large-v3
faster-whisper
300
6.27%
2.47%
8.9
8.7
683
795
1184
208
fw-large-v3-turbo
faster-whisper
300
6.48%
2.72%
14.9
14.8
404
426
288
124
whisper-medium
transformers
300
9.08%
3.53%
8.0
7.5
779
1023
0
237
whisper-large-v3
transformers
300
13.96%
8.92%
4.9
4.7
1412
1614
320
426
gemma-4-E2B-it
transformers
30
18.02%
11.00%
5.1
4.8
1230
1756
2
41
gemma-4-E4B-it
transformers
30
18.37%
10.87%
0.3
0.2
23342
35237
5406
705
whisper-large-v3-turbo
transformers
300
20.30%
12.59%
16.3
16.0
387
428
0
119
whisper-small
transformers
300
29.10%
19.72%
17.9
16.6
401
462
0
123
whisper-large-v2
transformers
300
30.83%
17.50%
4.7
4.5
1540
1653
340
473
whisper-base
transformers
300
70.73%
38.47%
33.3
31.4
254
259
0
79
whisper-tiny
transformers
300
95.53%
52.32%
42.4
40.5
214
203
0
67
Generated by speechbench .
Each cell is n=30 clips from the HuggingFace dataset's test split,
evaluated on a single NVIDIA T4 spot VM
(n1-standard-8, 30 GB RAM) in safecare-maps.
Raw per-clip transcripts (with punctuation preserved) are in
results/ on GitHub.
WER is computed on Whisper-style normalized text (lowercase, punctuation stripped, contractions expanded) via jiwer.
RTFx = audio seconds / wall seconds — higher is faster than real-time.
GPU peak is measured via pynvml at 100ms intervals during inference.