Whisper: OpenAI's Speech Recognition Model
Whisper is OpenAI's open-source automatic speech recognition model, supporting 99 languages with robust accuracy across accents, background noise, and technical content.
Specifications
At a glance
Parameters
39M (tiny) to 1.55B (large-v3)
Languages Supported
99 languages
Release Date
September 2022 (v3: November 2023)
Licence
MIT Licence (Open Source)
Architecture
Encoder-Decoder Transformer
Pricing
Free (self-hosted) / $0.006/min via API
Overview
About Whisper
Whisper is OpenAI's automatic speech recognition (ASR) model that has become the de facto standard for AI transcription. Trained on 680,000 hours of multilingual audio data, it delivers robust transcription accuracy across a wide range of conditions including accents, background noise, and technical vocabulary. Available in multiple sizes from 39M (tiny) to 1.55B (large-v3) parameters, Whisper can be deployed on everything from edge devices to cloud servers. The large-v3 model approaches human-level accuracy for many languages and significantly outperforms previous open-source ASR models. Whisper's MIT licence has made it the foundation of countless transcription products, podcast tools, meeting recorders, and accessibility applications. The model also supports translation, automatically converting speech in any supported language to English text.
Strengths
Capabilities
- 99-language speech recognition and transcription
- Robust to accents, background noise, and technical jargon
- Multiple model sizes from 39M to 1.55B parameters
- Automatic language detection
- Speech-to-English translation for all supported languages
- Timestamp generation for subtitle creation
- MIT licence enabling unrestricted commercial use
Considerations
Limitations
- Real-time transcription requires capable hardware for larger models
- Accuracy drops for low-resource languages
- No speaker diarisation (identifying who said what) built in
- Can hallucinate repeated phrases on silent or unclear audio
- No streaming support in the base model architecture
Best For
Ideal use cases
- Meeting transcription and note-taking applications
- Podcast and video subtitle generation
- Multilingual customer support transcription
- Accessibility tools for hearing-impaired users
- Voice-to-text input for applications and workflows
Pricing
Free under MIT licence for self-hosting. OpenAI API: $0.006/minute. Various cloud providers offer hosted Whisper at competitive rates.
FAQ
Frequently asked questions
Whisper large-v3 achieves word error rates under 5% for English, approaching human-level accuracy. Performance varies by language, accent, and audio quality. For well-recorded English speech, accuracy is typically 95-98%.
The smaller models (tiny, base) can transcribe in near-real-time on modern hardware. The large model is slower than real-time on most consumer GPUs. Community projects like faster-whisper use optimisations to achieve real-time with larger models.
No. Whisper does not include speaker diarisation. For multi-speaker transcription, combine Whisper with a separate diarisation model like pyannote-audio.
For English transcription: medium offers the best accuracy/speed balance. For multilingual: use large-v3. For edge/mobile deployment: tiny or base. For maximum accuracy regardless of speed: large-v3.
Whisper and Google Speech-to-Text are competitive. Whisper offers free self-hosting, 99-language support, and no per-minute costs. Google offers streaming, speaker diarisation, and better real-time performance out of the box.
Need help with Whisper?
Our team can help you evaluate and implement the right AI tools. Book a free strategy call.