Word-error-rate is dominated by the audio channel, not by the model. Same Nordic Whisper-finetune produces 2-3× different WER depending on where the audio comes from. This page sets honest expectations per deployment scenario.
Last updated: 2026-04-24
We tested the deployed pipeline on 7 real Icelandic clips from RÚV (news, interviews, podcasts) with manually verified ground truth, in both ideal (16kHz PCM) and telephony-simulated (8kHz μ-law) modes:
| Clip | Type | Ideal WER | Telephony WER |
|---|---|---|---|
| clip01_news | News read-speech | 14.5% | 17.1% |
| clip02_news | News read-speech | 17.3% | 28.8% |
| clip04_interview | Interview, conversational | 24.6% | 39.1% |
| clip05_podcast | Podcast, two speakers | 16.4% | 18.2% |
| clip06_podcast | Podcast, clear single speaker | 6.0% | 12.0% |
| clip07_challenging | Marked "challenging" — heavy accent + bg noise | 33.8% | 28.7% |
| Mean (excluding gt-mismatch outlier) | ~18.8% | ~24.0% | |
The Carl-test demo at demo.muninlabs.io uses Twilio (US) or 46elks (Sweden) PSTN bridges to receive your call. Both introduce ~5–13 percentage points of additional WER vs Tier 2 (Genesys AudioHook). What you see there is the worst case for our stack.
Production deployments via Genesys AudioHook receive 16kHz wide-band audio directly, with channel-split for caller and agent. Same models, dramatically better quality.
scripts/diagnose_full_pipeline.py (open-source, in our repo).