xAI released its Grok STT API on April 17, 2026 — one day ago. We immediately ran it against the two dominant speech-to-text APIs to see if it's worth switching. Here's what we found.
We compared:
api.x.ai/v1. Uses xAI's proprietary architecture, not a Whisper fork.We ran each model on 6 audio clips covering different conditions:
We measured Word Error Rate (WER) — lower is better. We also timed median response latency for a 60-second clip.
verbose_json response format for Grok and Whisper to get segment data.
| Test clip | Grok STT | Whisper v3 | Deepgram Nova-2 |
|---|---|---|---|
| Studio podcast | 2.1% | 2.8% | 3.4% |
| Remote meeting | 5.3% | 6.1% | 5.8% |
| Phone call (8kHz) | 8.2% | 9.4% | 7.1% |
| Accented English | 7.8% | 9.2% | 11.3% |
| Multilingual (ES→EN) | 4.4% | 5.1% | 14.2%* |
| Noisy background | 11.6% | 10.9% | 13.7% |
| Average WER | 6.6% | 7.3% | 9.3% |
* Deepgram Nova-2 does not natively handle mid-clip language switching. The Spanish segment was transcribed as mangled English phonetic approximations.
Grok STT leads overall, especially on accented speech and multilingual content. Whisper large-v3 comes close and actually edges ahead on noisy environments. Deepgram Nova-2 wins on phone-quality audio but falls behind on anything multilingual.
| Model | Median latency (60s clip) | Streaming? |
|---|---|---|
| Grok STT | 3.8s | No (batch) |
| Whisper large-v3 (API) | 5.2s | No |
| Deepgram Nova-2 | 0.9s | Yes |
Deepgram is in a different league for latency — it's designed for real-time streaming use cases. If you need live captions or sub-second response, Deepgram wins unconditionally.
Grok STT is batch-only (as of April 2026) but returns in under 4 seconds for a 60-second clip, which is fast enough for all asynchronous workflows.
| Model | Price | Minimum | Free tier |
|---|---|---|---|
| Grok STT (xAI) | ~$0.013/transcription | None | Via ScribeForge (2/day) |
| Whisper (OpenAI API) | $0.006/minute | None | No |
| Deepgram Nova-2 | $0.0043/minute | None | $200 credit |
Whisper and Deepgram are priced per minute of audio, so a 30-second clip costs half as much as a 60-second one. Grok STT pricing is per request — a short clip costs the same as a long one. For short clips (<30s), Whisper/Deepgram are often cheaper. For longer files, Grok becomes competitive.
Try Grok STT right now — no account, no API key, no credit card. 2 free transcriptions per day.
Transcribe with Grok STT →