Reference Guide · April 2026
Grok xAI Supported Audio Formats for Transcription: Complete Guide
By ScribeForge · April 25, 2026 · 6 min read
ScribeForge uses the xAI Grok STT API for browser-based transcription. Upload any supported audio file and get a transcript in seconds — no API key required.
Try it free →
Grok, the AI model developed by xAI, supports a range of audio transcription formats. Knowing which file types are accepted — and their practical limits — helps you avoid upload errors and get more predictable results. If you'd rather skip the API setup, you can upload supported formats directly at scribeforge.tech — 100 MB per file, 2 free transcripts per day, no account.
As of 2026, Grok's transcription API accepts the following audio file formats:
| Format | Type | Common use case |
| MP3 | Lossy | Podcasts, music, most recorded audio |
| WAV | Uncompressed | Studio recordings, high-quality speech capture |
| FLAC | Lossless | Archival audio, maximum accuracy |
| M4A / AAC | Lossy | iOS voice memos, Apple devices |
| OGG | Lossy | Web and game audio pipelines |
| WEBM | Lossy | Browser MediaRecorder output |
Note: Format support may evolve as xAI updates Grok. Always verify against the official xAI documentation before production deployment.
Want to test these formats without writing code? ScribeForge accepts all six (MP3, WAV, FLAC, M4A, OGG, WEBM) for free in your browser — drag the file and get text in 10 seconds.
- MP3 — Most common lossy format; widely compatible and smallest file size at equivalent quality.
- WAV — Uncompressed audio; usually the safest choice for studio or high-quality recordings.
- FLAC — Lossless compressed; ideal balance between file size and quality for archival audio.
- M4A / AAC — Apple-native lossy format; common output from iOS devices and voice memos.
- OGG — Open-source lossy format; frequently used in web and game audio pipelines.
- WEBM — Browser-native container (VP8/Opus); direct output from browser MediaRecorder API.
File Size and Duration Limits
Beyond format, Grok enforces practical upload constraints:
- Max file size on the xAI Grok STT API: 500 MB per request.
- Max file size on ScribeForge (free in-browser tool): 100 MB per upload — split larger recordings with ffmpeg before uploading.
- Max duration: recordings up to ~90-100 minutes per file are now practical with MP3-class compression; longer files should still be split into segments.
- Sample rate: 16 kHz mono is the recommended minimum for reliable speech transcription; xAI accepts 8 kHz to 48 kHz (G.711 µ-law/A-law for telephony pipelines).
- Bit depth: 16-bit PCM recommended for WAV/FLAC files.
File too large? Split with FFmpeg: ffmpeg -i long_audio.mp3 -f segment -segment_time 3600 -c copy part%03d.mp3 — this splits into 60-minute segments without re-encoding.
Which Format Gives the Best Transcription Accuracy?
For production transcription pipelines, format choice impacts quality in this rough order:
- WAV (PCM uncompressed) — Zero quality loss; best results on noisy or speech-heavy audio.
- FLAC — Lossless compression; virtually identical accuracy to WAV with ~40-60% smaller files.
- MP3 at 128 kbps+ — Minimal accuracy loss for clear speech; fine for podcasts, interviews, meetings.
- M4A / AAC at 128 kbps+ — Comparable to MP3; good choice for mobile-recorded audio.
- OGG / WEBM — Acceptable for web pipelines; minor quality trade-off vs lossless.
For most use cases (meetings, interviews, podcasts), MP3 at 128 kbps or higher is sufficient. The quality difference vs WAV is negligible for clear speech in a quiet environment. Reserve WAV/FLAC for recordings with background noise or low speaking volume where every bit of quality matters.
How to Convert Audio to a Supported Format
If your file is in an unsupported format (e.g. AMR, WMA, AIFF), convert it before uploading:
FFmpeg (CLI — recommended)
bash
# Convert any format to 16 kHz mono WAV (optimal for STT)
ffmpeg -i input.wma -ar 16000 -ac 1 output.wav
# Convert to MP3 at 128 kbps
ffmpeg -i input.aiff -b:a 128k output.mp3
# Extract audio from video
ffmpeg -i video.mp4 -vn -ar 16000 -ac 1 audio.wav
Audacity (GUI)
Open the file → File → Export → choose WAV or MP3 → set sample rate to 16000 Hz in Project Rate (bottom-left).
Online tools
CloudConvert and Zamzar support most conversions without software installation — useful for one-off files where installing FFmpeg isn't practical.
Grok Audio Transcription vs Other AI Tools
Grok's format support is comparable to OpenAI Whisper (MP3, MP4, WAV, WEBM, M4A, OGG, FLAC) and Google Speech-to-Text. The key differentiators are inference speed and context window size for post-processing.
| Tool | Supported formats | Max size |
| Grok (xAI) | MP3, WAV, FLAC, M4A, OGG, WEBM | 500 MB API / 100 MB on ScribeForge |
| OpenAI Whisper | MP3, MP4, WAV, WEBM, M4A, OGG, FLAC | 25 MB |
| Google STT | FLAC, WAV, MP3, OGG, WEBM, AMR | Varies by tier |
Tools like ScribeForge abstract format handling entirely — upload any supported file and the platform routes it to the Grok STT engine automatically, with no API key, no account, and no conversion step required on your end for standard formats.
Frequently Asked Questions
- Does Grok support MP4 video files for audio transcription?
- Grok's transcription API is primarily designed for audio containers. MP4 video files may be accepted if they contain an audio stream, but it is recommended to extract the audio track first (e.g. with FFmpeg:
ffmpeg -i video.mp4 -vn -ar 16000 audio.wav) to ensure compatibility.
- What is the maximum audio file size Grok accepts?
- The xAI Grok Speech-to-Text API accepts files up to 500 MB per request on the batch endpoint. ScribeForge, the no-code browser interface to Grok STT, currently caps uploads at 100 MB per file with 2 free transcripts per day; for larger recordings, split with ffmpeg before uploading.
- Is FLAC better than MP3 for Grok transcription?
- FLAC preserves the original audio quality without loss, which can improve transcription accuracy on low-volume or noisy recordings. For clear speech at 128 kbps+, the practical difference between FLAC and MP3 in transcription output is minimal.
- Can I use Grok transcription with browser-recorded WEBM files?
- Yes. WEBM (with Opus audio codec) is supported and is the native output of the browser MediaRecorder API, making it convenient for web-based voice recording applications without any conversion step. If you've recorded WEBM in a browser, you can transcribe it directly at scribeforge.tech — no conversion required.
- Does sample rate affect Grok transcription quality?
- Yes. A minimum sample rate of 16 kHz mono is recommended. Audio recorded at 8 kHz (telephony) may show reduced accuracy. Recording above 16 kHz generally does not further improve results and increases file size.
- Where can I transcribe these audio formats for free?
- ScribeForge (scribeforge.tech) accepts MP3, WAV, FLAC, M4A, OGG, and WEBM up to 100 MB in your browser, with 2 free transcripts per day and no account required. Paid credits start at $9 for 50 transcriptions if you need more volume.
Use it for
Related reading
Use Grok STT without an API key — upload up to 100 MB, get timestamps in your browser, and unlock the full transcript only if you need it.
Try Grok STT free →
No account · 2 free/day · 100 MB uploads