When a recording has more than one person talking, a plain transcript is often not enough. You need to know who said what — and at what point in the conversation. That's where speaker diarization comes in. This guide explains how the technology works, why timestamps matter, and how to use ScribeForge to get a fully labelled transcript in seconds.
Upload your audio file and get a transcript with speaker labels — free, no account needed.
Transcribe with Speaker Labels →Speaker diarization is the process of automatically partitioning an audio stream into segments according to who is speaking. The term comes from "diary" — the system creates a timeline of which speaker is active at each point in the recording. The output assigns each segment a speaker identifier (Speaker 1, Speaker 2, and so on) along with a start and end timestamp.
Without diarization, a transcript of a two-person interview looks like a wall of text. With diarization, you get a structured conversation that reads almost like a script — each speaker's words are clearly separated, making it far easier to quote, summarize, or review specific parts of the discussion.
Here is an example of what labelled output looks like:
Compare that to a flat transcript where all four lines run together — finding that 20% figure later requires reading the whole thing. With speaker labels and timestamps, you can jump straight to the moment you care about.
Timestamps solve several practical problems that labels alone do not. First, they let you navigate the original recording. If you need to hear the exact tone of a quote or verify context, a timestamp tells you exactly where to scrub the media player. Second, they serve as a reference if you need to correlate the transcript with a video recording of the same meeting — you can align the two by time code.
Third, timestamps are essential for generating subtitles or closed captions. Subtitle formats like SRT and VTT are built around time ranges: each caption block has a start time and an end time. A transcript with phrase-level timestamps is the foundation for producing those files. Fourth, in research and journalism, timestamps function as citations. When you quote someone, you can point to the exact second in the source recording, which adds accountability and allows editors or fact-checkers to verify the quote independently.
ScribeForge outputs both segment-level timestamps (start and end time for each labelled block) and, if you use the xAI Grok API directly, word-level timestamps down to the individual token. For most practical purposes, segment-level timestamps are sufficient and much easier to read.
ScribeForge is powered by the xAI Grok Speech-to-Text API (model: scribe_v2). When you upload an audio file, the service sends it to Grok STT, which processes the audio and returns a structured response that includes the full transcript text, a list of timed segments, and speaker identifiers for each segment. ScribeForge then groups those segments into labelled blocks and displays them in a readable, copy-ready format.
The process takes between 10 and 30 seconds for most files, depending on length. A 30-minute interview typically completes in under a minute.
Drag and drop or click "Upload audio" on the ScribeForge homepage. Files up to 25 MB are accepted — roughly 25 minutes of audio at standard quality.
The file is sent to xAI's scribe_v2 model. It transcribes the speech and identifies speaker changes, returning timestamped segments with speaker labels.
The output is displayed with speaker labels and timestamps. Copy the full text to clipboard or download it as a plain text file — no account or watermarks required.
ScribeForge accepts 10 audio and video container formats:
This covers virtually every recording scenario: Zoom and Google Meet exports (MP4, M4A), podcast production files (WAV, FLAC), iPhone voice memos (M4A), Android recordings (OGG, OPUS), and browser-based recordings (WEBM). The maximum file size is 25 MB. If your recording is longer than roughly 25 minutes, consider trimming the file in Audacity or exporting it at a lower bitrate before uploading.
Grok STT automatically detects the language of the audio — you do not need to specify it before uploading. ScribeForge supports 25+ languages including English, Spanish, French, German, Italian, Portuguese, Dutch, Polish, Swedish, Japanese, Mandarin, Korean, Arabic, Hindi, Turkish, and more. The detected language is shown alongside the transcript output.
Auto-detection handles recordings that switch languages mid-conversation. If you have a multilingual interview where the participants shift between English and Spanish, Grok STT handles the transition without requiring any manual language tagging. If you want to nudge the model toward a specific language on ambiguous or low-quality audio, you can type the language in the optional language field before uploading.
Speaker diarization accuracy depends heavily on audio quality and recording conditions. A few practical steps that significantly improve output quality:
On a well-recorded two-person interview with minimal background noise, diarization accuracy is consistently high. For panel recordings with five or more speakers, expect occasional mis-attribution when voices are acoustically similar.
ScribeForge offers 2 free transcriptions per day per IP address — no account required. The free tier includes full speaker labels and timestamps, not a reduced version. This is enough for occasional use: transcribing a single interview, converting a recorded meeting, or testing the accuracy on your specific audio conditions before committing to a plan.
For regular use, there are two paid options. The $9 credit pack gives you 50 transcriptions with no expiry — a good fit for journalists, researchers, or podcasters who transcribe a moderate volume of recordings. The $19/month plan is unlimited and makes sense for teams or anyone processing audio daily.
Get your speaker-labelled transcript free →Ready to transcribe? Upload your file and get speaker labels and timestamps in under 30 seconds.
Start Transcribing Free →No account · No credit card · 2 free/day