What audio formats does Grok support?

Grok STT supports WAV, MP3, OGG, Opus, FLAC, AAC, MP4, M4A, MKV, plus PCM, µ-law, and A-law telephony formats.

What is the file size limit for Grok transcription?

The xAI API supports files up to 500 MB per batch request. ScribeForge currently accepts uploads up to 100 MB in the browser.

Quick answer · Updated April 28, 2026

Can Grok (xAI) Transcribe Audio?

Q: Does Grok have STT (speech-to-text)?

Yes. Grok STT is xAI's speech-to-text API, available via POST /v1/stt for batch transcription and a WebSocket endpoint for streaming.

Q: How much does Grok audio transcription cost?

The xAI API charges $0.10 per hour for batch transcription and $0.20 per hour for streaming. ScribeForge offers a 200-character preview plus paid plans starting at $9 for 50 transcripts.

A direct answer with supported formats, languages, limits, and a free in-browser tool — no code, no account.

Yes. xAI launched the Grok Speech-to-Text (STT) API on April 18, 2026. It transcribes audio in 25+ languages with word-level timestamps and accepts 12 audio formats (WAV, MP3, OGG, Opus, FLAC, AAC, MP4, M4A, MKV, plus PCM/µ-law/A-law) up to 500 MB per request on the xAI API.

ScribeForge runs Grok STT in the browser with a 100 MB upload limit, no code, and no account. Speaker labels may appear on recordings with clear separation, but they are not reliable on every file. Pricing is $0.10/hour batch, $0.20/hour streaming.

Two ways to use it: (1) call POST https://api.x.ai/v1/stt directly with your xAI API key, or (2) drop your file into ScribeForge — a no-code browser wrapper around the same API (100 MB upload cap, 200-character preview free, no signup).

Yes — but how does Grok transcribe audio?

Grok is xAI's family of AI models. On April 18, 2026, xAI launched the Grok Speech-to-Text API at POST https://api.x.ai/v1/stt (batch) and wss://api.x.ai/v1/stt (real-time streaming). It is built on the same audio stack that powers Grok Voice, Tesla in-car voice and Starlink customer support. You send audio via multipart form upload and get back a JSON object with the transcript, word-level timestamps, detected language, and sometimes speaker labels depending on the recording.

This is a separate API from the chat-style Grok used in the xAI chat interface — and it is not reachable from a Grok chat prompt directly. You either call the STT API yourself, or use a service that wraps the API. ScribeForge is one such service: it handles authentication, multipart upload, response parsing, and presents the transcript in your browser. Read the technical breakdown of Grok AI audio transcription for how the model works internally.

Which audio formats does Grok (xAI) support for transcription?

The Grok STT API officially accepts 12 audio formats: 9 container formats (WAV, MP3, OGG, Opus, FLAC, AAC, MP4, M4A, MKV) plus 3 raw formats (PCM Linear16 8–48 kHz, G.711 µ-law, G.711 A-law) for telephony. Here is the practical reference for the formats users actually encounter:

Format	Extension	Best for
MP3	.mp3	Podcasts, voice notes, general purpose. Use 128 kbps+.
WAV	.wav	Legal, medical, interviews — best accuracy (lossless).
M4A	.m4a	iPhone Voice Memos. Native support, no conversion needed.
FLAC	.flac	Archive-quality recordings (lossless compressed).
OGG / Opus	.ogg, .opus	WhatsApp voice notes, Android recorders, web apps.
MKV	.mkv	Matroska container — Grok extracts the audio track.
AAC	.aac	Apple ecosystem, streaming.
MP4	.mp4	Video — Grok extracts the audio track. Useful for Zoom/Teams.
PCM / µ-law / A-law	raw	Telephony pipelines (8 kHz call recording, G.711).

For the longest-form deep dive on each format and accuracy tips per codec, read the full Grok STT supported audio formats guide.

How many languages does Grok speech-to-text support?

Grok STT covers 25+ languages with automatic detection — you don't need to specify the language, though you can pass a hint via the language parameter. Tested-good languages include English, Spanish, French, German, Italian, Portuguese, Dutch, Polish, Turkish, Russian, Arabic, Hindi, Japanese, Korean, Mandarin, Vietnamese, Indonesian, Swahili, Hebrew, Thai, and more.

Performance is highest on the languages with the most training data (English, Spanish, French, German, Mandarin, Japanese, Portuguese, Italian) but remains usable across the long tail. The API also handles seamless mid-stream language switching for multilingual audio.

How accurate is Grok STT?

xAI has published a 5.0% error rate on its phone-call entity recognition benchmark. That is useful context, but it should not be read as a guarantee for meetings, interviews, voice notes, or noisy real-world recordings.

Provider	Phone-call entity error rate
Grok STT (xAI)	5.0%
ElevenLabs	12.0%
Deepgram	13.5%
AssemblyAI	21.3%

In practice, transcript quality depends on microphone quality, overlap, background noise, accents, and compression. Treat the benchmark as directional, then validate on your own files.

For rough real-world expectations across audio quality buckets:

Clean studio audio: ~5% WER (excellent — matches human transcription)
Conversational meeting audio: ~9% WER (good — minor cleanup needed)
Heavy accent / dialect: ~12% WER (acceptable for notes, less so for legal)
Phone-quality compressed audio: ~15% WER (workable for keyword searching)

Full side-by-side benchmarks vs Whisper large-v3 and Deepgram Nova-2 are in Grok STT vs Whisper vs Deepgram in 2026.

File size, duration and pricing

Max file size on the xAI API: 500 MB per request (batch endpoint).
Max file size on ScribeForge (free in-browser tool): 100 MB per upload — split larger recordings with ffmpeg.
Pricing on the xAI API: $0.10 per hour for batch processing, $0.20 per hour for real-time streaming.
Pricing on ScribeForge: 200-character preview free; $9 one-time for 50 full transcripts (never expire); $19/month unlimited (200/day cap).
Sample rate: 8 kHz to 48 kHz supported. 16 kHz mono is optimal for speech and reduces upload size significantly.

How to transcribe audio with Grok right now (no code)

Open the ScribeForge homepage in any browser, mobile or desktop.
Drop your audio file onto the upload area, or tap to browse from your phone.
Click Transcribe. ScribeForge sends the file to xAI's Grok STT and returns the transcript in 5–15 seconds.
Copy the transcript or download it as a .txt file. Free users see a 200-character preview; paid plans get the full transcript with timestamps, and speaker labels when the recording allows clear separation.

No account, no credit card, no email needed for the preview. The audio file is deleted from our servers immediately after the transcription completes.

Use Grok STT in your browser right now — no API key, instant preview, timestamps included.

Try Grok STT Free →

No account · 2 free/day · 100 MB uploads

Using the xAI Grok STT API directly (developers)

If you'd rather hit the xAI API yourself, the endpoint is POST https://api.x.ai/v1/stt with multipart form upload. Minimal Python:

import requests

with open("audio.mp3", "rb") as f:
    r = requests.post(
        "https://api.x.ai/v1/stt",
        headers={"Authorization": f"Bearer {XAI_API_KEY}"},
        files={"file": ("audio.mp3", f, "audio/mpeg")},
        data={"model": "grok-stt", "format": "json", "language": "en"},
        timeout=120,
    )

print(r.json()["text"])

For working production code with timestamps, error handling, retry logic, and cost estimation, see the xAI Grok STT API developer guide.

Common variations of this question

Does Grok have STT (speech-to-text)?

Yes. Grok STT is the grok-stt model, accessible at POST /v1/stt on the xAI API (and a streaming variant at wss://api.x.ai/v1/stt). It is a dedicated speech recognition API — not the conversational Grok chat model.

Does Grok AI support .m4a file uploads for audio transcription?

Yes. M4A is the default format for iPhone Voice Memos and is one of the 12 audio formats Grok STT accepts natively — no conversion to MP3 required. Drop the file directly.

Does Grok AI support audio transcription and translation?

Grok STT transcribes the source language faithfully and detects the language automatically. For translation to a different language, run the transcript through a chat model (Grok, Claude, GPT) afterward — the STT endpoint itself returns same-language text.

Does Grok AI support audio transcription in 2026?

Yes — the Grok Speech-to-Text API launched on April 18, 2026. As of this article, it is generally available via the xAI API (batch + WebSocket streaming) and via ScribeForge in the browser.

How to transcribe audio in Grok / xAI?

Two paths: (1) call POST /v1/stt with your xAI API key (see developer guide), or (2) use ScribeForge to do it in the browser without writing code. Both go through the same grok-stt model.

What is the best Grok STT format for accuracy?

WAV at 16 kHz mono for highest accuracy on important content (legal, medical, interviews). MP3 at 128 kbps for everything else — it's 10× smaller with negligible accuracy loss for speech.

How much does Grok audio transcription cost?

The xAI Grok STT API costs $0.10 per hour of audio for batch processing and $0.20 per hour for real-time streaming. ScribeForge's $9 / 50-credit plan works out to ~$0.18 per transcription regardless of length, with no per-minute metering up to the 100 MB browser cap.