Quick answer · Updated April 28, 2026

Can Grok (xAI) Transcribe Audio?

A direct answer with supported formats, languages, accuracy, and a free in-browser tool — no code, no account.

Yes. xAI launched the Grok Speech-to-Text (STT) API on April 18, 2026. It transcribes audio in 25+ languages with word-level timestamps, speaker diarization and Inverse Text Normalization. It accepts 12 audio formats (WAV, MP3, OGG, Opus, FLAC, AAC, MP4, M4A, MKV, plus PCM/µ-law/A-law) up to 500 MB per request on the xAI API.

On xAI's official phone-call entity recognition benchmark, Grok STT reports a 5.0% error rate — versus ElevenLabs at 12.0%, Deepgram at 13.5% and AssemblyAI at 21.3%. Pricing is $0.10/hour batch, $0.20/hour streaming.

Two ways to use it: (1) call POST https://api.x.ai/v1/stt directly with your xAI API key, or (2) drop your file into ScribeForge — a no-code browser wrapper around the same API (25 MB upload cap, 200-character preview free, no signup).

Yes — but how does Grok transcribe audio?

Grok is xAI's family of AI models. On April 18, 2026, xAI launched the Grok Speech-to-Text API at POST https://api.x.ai/v1/stt (batch) and wss://api.x.ai/v1/stt (real-time streaming). It is built on the same audio stack that powers Grok Voice, Tesla in-car voice and Starlink customer support. You send audio via multipart form upload and get back a JSON object with the transcript, word-level timestamps, detected language, and per-speaker diarization.

This is a separate API from the chat-style Grok used in the xAI chat interface — and it is not reachable from a Grok chat prompt directly. You either call the STT API yourself, or use a service that wraps the API. ScribeForge is one such service: it handles authentication, multipart upload, response parsing, and presents the transcript in your browser. Read the technical breakdown of Grok AI audio transcription for how the model works internally.

Which audio formats does Grok (xAI) support for transcription?

The Grok STT API officially accepts 12 audio formats: 9 container formats (WAV, MP3, OGG, Opus, FLAC, AAC, MP4, M4A, MKV) plus 3 raw formats (PCM Linear16 8–48 kHz, G.711 µ-law, G.711 A-law) for telephony. Here is the practical reference for the formats users actually encounter:

FormatExtensionBest for
MP3.mp3Podcasts, voice notes, general purpose. Use 128 kbps+.
WAV.wavLegal, medical, interviews — best accuracy (lossless).
M4A.m4aiPhone Voice Memos. Native support, no conversion needed.
FLAC.flacArchive-quality recordings (lossless compressed).
OGG / Opus.ogg, .opusWhatsApp voice notes, Android recorders, web apps.
MKV.mkvMatroska container — Grok extracts the audio track.
AAC.aacApple ecosystem, streaming.
MP4.mp4Video — Grok extracts the audio track. Useful for Zoom/Teams.
PCM / µ-law / A-lawrawTelephony pipelines (8 kHz call recording, G.711).

For the longest-form deep dive on each format and accuracy tips per codec, read the full Grok STT supported audio formats guide.

How many languages does Grok speech-to-text support?

Grok STT covers 25+ languages with automatic detection — you don't need to specify the language, though you can pass a hint via the language parameter. Tested-good languages include English, Spanish, French, German, Italian, Portuguese, Dutch, Polish, Turkish, Russian, Arabic, Hindi, Japanese, Korean, Mandarin, Vietnamese, Indonesian, Swahili, Hebrew, Thai, and more.

Performance is highest on the languages with the most training data (English, Spanish, French, German, Mandarin, Japanese, Portuguese, Italian) but remains usable across the long tail. The API also handles seamless mid-stream language switching for multilingual audio.

How accurate is Grok STT?

On xAI's official benchmark — phone-call entity recognition (names, account numbers, dates) — Grok STT reports a 5.0% error rate, the lowest among published competitors:

ProviderPhone-call entity error rate
Grok STT (xAI)5.0%
ElevenLabs12.0%
Deepgram13.5%
AssemblyAI21.3%

For real-world expectations on Word Error Rate across audio quality buckets:

Full side-by-side benchmarks vs Whisper large-v3 and Deepgram Nova-2 are in Grok STT vs Whisper vs Deepgram in 2026.

File size, duration and pricing

How to transcribe audio with Grok right now (no code)

  1. Open the ScribeForge homepage in any browser, mobile or desktop.
  2. Drop your audio file onto the upload area, or tap to browse from your phone.
  3. Click Transcribe. ScribeForge sends the file to xAI's Grok STT and returns the transcript in 5–15 seconds.
  4. Copy the transcript or download it as a .txt file. Free users see a 200-character preview; paid plans get the full transcript with timestamps and speaker labels.

No account, no credit card, no email needed for the preview. The audio file is deleted from our servers immediately after the transcription completes.

Try Grok audio transcription in your browser — instant preview.

Transcribe audio free →

No account  ·  No credit card  ·  2 free uses/day per IP

Using the xAI Grok STT API directly (developers)

If you'd rather hit the xAI API yourself, the endpoint is POST https://api.x.ai/v1/stt with multipart form upload. Minimal Python:

import requests

with open("audio.mp3", "rb") as f:
    r = requests.post(
        "https://api.x.ai/v1/stt",
        headers={"Authorization": f"Bearer {XAI_API_KEY}"},
        files={"file": ("audio.mp3", f, "audio/mpeg")},
        data={"model": "grok-stt", "format": "json", "language": "en"},
        timeout=120,
    )

print(r.json()["text"])

For working production code with timestamps, error handling, retry logic, and cost estimation, see the xAI Grok STT & TTS API developer guide.

Common variations of this question

Does Grok have STT (speech-to-text)?

Yes. Grok STT is the grok-stt model, accessible at POST /v1/stt on the xAI API (and a streaming variant at wss://api.x.ai/v1/stt). It is a dedicated speech recognition API — not the conversational Grok chat model.

Does Grok AI support .m4a file uploads for audio transcription?

Yes. M4A is the default format for iPhone Voice Memos and is one of the 12 audio formats Grok STT accepts natively — no conversion to MP3 required. Drop the file directly.

Does Grok AI support audio transcription and translation?

Grok STT transcribes the source language faithfully and detects the language automatically. For translation to a different language, run the transcript through a chat model (Grok, Claude, GPT) afterward — the STT endpoint itself returns same-language text.

Does Grok AI support audio transcription in 2026?

Yes — the Grok Speech-to-Text API launched on April 18, 2026. As of this article, it is generally available via the xAI API (batch + WebSocket streaming) and via ScribeForge in the browser.

How to transcribe audio in Grok / xAI?

Two paths: (1) call POST /v1/stt with your xAI API key (see developer guide), or (2) use ScribeForge to do it in the browser without writing code. Both go through the same grok-stt model.

What is the best Grok STT format for accuracy?

WAV at 16 kHz mono for highest accuracy on important content (legal, medical, interviews). MP3 at 128 kbps for everything else — it's 10× smaller with negligible accuracy loss for speech.

How much does Grok audio transcription cost?

The xAI Grok STT API costs $0.10 per hour of audio for batch processing and $0.20 per hour for real-time streaming. ScribeForge's $9 / 50-credit plan works out to ~$0.18 per transcription regardless of length, with no per-minute metering up to the 25 MB browser cap.

Related reading

Stop wondering — try Grok STT in 10 seconds.

Transcribe audio free →

Powered by the xAI Grok Speech-to-Text API  ·  No account  ·  No credit card