WER expectations by environment

Word-error-rate is dominated by the audio channel, not by the model. Same Nordic Whisper-finetune produces 2-3× different WER depending on where the audio comes from. This page sets honest expectations per deployment scenario.

Last updated: 2026-04-24

The four tiers

Tier 1 — Lab / clean read-speech

~6–10% WER

FLEURS, NPSC clean read-speech, 16kHz mono. Used for academic benchmarks. Not representative of real call audio.

Tier 2 — Genesys AudioHook (production target)

~12–18% WER

Wide-band 16kHz audio direct from Genesys Cloud over WSS. PCMU codec but no PSTN-internasjonal degradation. Stereo channel-split for caller/agent. This is what your customers will see.

Tier 3 — PSTN-international demo (Twilio / 46elks)

~18–28% WER

8kHz μ-law over international PSTN. Multiple codec hops, audio compression, network jitter. This is what the public demo at demo.muninlabs.io uses. Worst-case representative.

Tier 4 — Background-noise call-center realism

~22–35% WER

Real production audio with keyboard noise, multiple speakers, hold-music bleed, accent variations. Improves with per-customer LoRA finetune (Q3 2026 premium feature).

Real benchmark on our YouTube IS-clips

We tested the deployed pipeline on 7 real Icelandic clips from RÚV (news, interviews, podcasts) with manually verified ground truth, in both ideal (16kHz PCM) and telephony-simulated (8kHz μ-law) modes:

Clip	Type	Ideal WER	Telephony WER
clip01_news	News read-speech	14.5%	17.1%
clip02_news	News read-speech	17.3%	28.8%
clip04_interview	Interview, conversational	24.6%	39.1%
clip05_podcast	Podcast, two speakers	16.4%	18.2%
clip06_podcast	Podcast, clear single speaker	6.0%	12.0%
clip07_challenging	Marked "challenging" — heavy accent + bg noise	33.8%	28.7%
Mean (excluding gt-mismatch outlier)		~18.8%	~24.0%

Why the public demo will look worse than your production deployment

The Carl-test demo at demo.muninlabs.io uses Twilio (US) or 46elks (Sweden) PSTN bridges to receive your call. Both introduce ~5–13 percentage points of additional WER vs Tier 2 (Genesys AudioHook). What you see there is the worst case for our stack.

Production deployments via Genesys AudioHook receive 16kHz wide-band audio directly, with channel-split for caller and agent. Same models, dramatically better quality.

Methodology

Diagnostic script: scripts/diagnose_full_pipeline.py (open-source, in our repo).
Models: faster-whisper-large-v3 INT8 + per-language Nordic finetune (NB, IS).
WER computation: word-level Levenshtein distance, lowercase + punctuation-insensitive.
Hardware: NVIDIA A10G (g5.xlarge eu-west-1).
Pipeline: STT → halluc-filter → capitalize → PII-redact → analyze → translate → summarize.

For full benchmark methodology and per-clip diagnostic reports, see docs/benchmarks/ground-truth-report.md in our repository. Public summary doc docs/WER_BY_ENVIRONMENT.md contains training-data provenance and per-language breakdowns.