Transcription · April 2026

How to Transcribe Audio with Speaker Labels

By ScribeForge · April 30, 2026 · 6 min read

When a recording has more than one person talking, a plain transcript is often not enough. You need to know who said what — and at what point in the conversation. That's where speaker diarization comes in. This guide explains how the technology works, why timestamps matter, and how to use ScribeForge to get a fully labelled transcript in seconds.

Upload your audio file and get a transcript with speaker labels — free, no account needed.

Transcribe with Speaker Labels →

What is speaker diarization?

Speaker diarization is the process of automatically partitioning an audio stream into segments according to who is speaking. The term comes from "diary" — the system creates a timeline of which speaker is active at each point in the recording. The output assigns each segment a speaker identifier (Speaker 1, Speaker 2, and so on) along with a start and end timestamp.

Without diarization, a transcript of a two-person interview looks like a wall of text. With diarization, you get a structured conversation that reads almost like a script — each speaker's words are clearly separated, making it far easier to quote, summarize, or review specific parts of the discussion.

Here is an example of what labelled output looks like:

Speaker 1 0:00 Thanks for joining us today. Can you walk us through what your company does?

Speaker 2 0:07 Sure. We help mid-sized logistics firms reduce delivery delays using real-time route optimization.

Speaker 1 0:16 And what's the typical outcome for a new customer in the first three months?

Speaker 2 0:23 Most see a 15 to 20 percent reduction in late deliveries within the first quarter.

Compare that to a flat transcript where all four lines run together — finding that 20% figure later requires reading the whole thing. With speaker labels and timestamps, you can jump straight to the moment you care about.

Why timestamps are as important as the labels

Timestamps solve several practical problems that labels alone do not. First, they let you navigate the original recording. If you need to hear the exact tone of a quote or verify context, a timestamp tells you exactly where to scrub the media player. Second, they serve as a reference if you need to correlate the transcript with a video recording of the same meeting — you can align the two by time code.

Third, timestamps are essential for generating subtitles or closed captions. Subtitle formats like SRT and VTT are built around time ranges: each caption block has a start time and an end time. A transcript with phrase-level timestamps is the foundation for producing those files. Fourth, in research and journalism, timestamps function as citations. When you quote someone, you can point to the exact second in the source recording, which adds accountability and allows editors or fact-checkers to verify the quote independently.

ScribeForge outputs both segment-level timestamps (start and end time for each labelled block) and, if you use the xAI Grok API directly, word-level timestamps down to the individual token. For most practical purposes, segment-level timestamps are sufficient and much easier to read.

How ScribeForge produces speaker-labelled transcripts

ScribeForge is powered by the xAI Grok Speech-to-Text API (model: scribe_v2). When you upload an audio file, the service sends it to Grok STT, which processes the audio and returns a structured response that includes the full transcript text, a list of timed segments, and speaker identifiers for each segment. ScribeForge then groups those segments into labelled blocks and displays them in a readable, copy-ready format.

The process takes between 10 and 30 seconds for most files, depending on length. A 30-minute interview typically completes in under a minute.

Upload your file

Drag and drop or click "Upload audio" on the ScribeForge homepage. Files up to 100 MB are accepted — roughly 100 minutes of audio at standard quality.

Grok STT processes the audio

The file is sent to xAI's scribe_v2 model. It transcribes the speech and identifies speaker changes, returning timestamped segments with speaker labels.

Read, copy, or download your transcript

The output is displayed with speaker labels and timestamps. Copy the full text to clipboard or download it as a plain text file — no account or watermarks required.

Supported file formats

ScribeForge accepts 10 audio and video container formats:

MP3

WAV

M4A

FLAC

OGG

OPUS

WEBM

AAC

AIFF

MP4

This covers virtually every recording scenario: Zoom and Google Meet exports (MP4, M4A), podcast production files (WAV, FLAC), iPhone voice memos (M4A), Android recordings (OGG, OPUS), and browser-based recordings (WEBM). The maximum file size is 100 MB. If your recording is still larger, consider trimming the file in Audacity or exporting it at a lower bitrate before uploading.

Language support and auto-detection

Grok STT automatically detects the language of the audio — you do not need to specify it before uploading. ScribeForge supports 25+ languages including English, Spanish, French, German, Italian, Portuguese, Dutch, Polish, Swedish, Japanese, Mandarin, Korean, Arabic, Hindi, Turkish, and more. The detected language is shown alongside the transcript output.

Auto-detection handles recordings that switch languages mid-conversation. If you have a multilingual interview where the participants shift between English and Spanish, Grok STT handles the transition without requiring any manual language tagging. If you want to nudge the model toward a specific language on ambiguous or low-quality audio, you can type the language in the optional language field before uploading.

Getting the best diarization results

Speaker diarization accuracy depends heavily on audio quality and recording conditions. A few practical steps that significantly improve output quality:

Use a dedicated microphone per speaker when possible. Multi-mic setups or lapel microphones give each speaker a distinct audio channel, making speaker separation much easier for the model.
Avoid overlapping speech. When two people talk at the same time, the model has to make a judgment call about attribution. Brief overlaps are handled reasonably well, but long crosstalk reduces accuracy.
Minimize background noise. Air conditioning, street noise, and keyboard clicks all reduce transcription accuracy. Record in a quiet room or apply noise reduction before uploading.
Keep the recording at 16kHz or higher sample rate. Most voice recorders and conference apps output at 16–48kHz by default, so this is rarely an issue in practice.
Trim silence at the start and end of the file. Very long silent gaps can occasionally confuse segment boundaries. A clean recording that starts promptly produces cleaner output.

On a well-recorded two-person interview with minimal background noise, diarization accuracy is consistently high. For panel recordings with five or more speakers, expect occasional mis-attribution when voices are acoustically similar.

Pricing: free tier and paid plans

ScribeForge offers 2 free transcriptions per day per IP address — no account required. The free tier includes full speaker labels and timestamps, not a reduced version. This is enough for occasional use: transcribing a single interview, converting a recorded meeting, or testing the accuracy on your specific audio conditions before committing to a plan.

For regular use, there are two paid options. The $9 credit pack gives you 50 transcriptions with no expiry — a good fit for journalists, researchers, or podcasters who transcribe a moderate volume of recordings. The $19/month plan is unlimited and makes sense for teams or anyone processing audio daily.

Get your speaker-labelled transcript free →

Common use cases for speaker-labelled transcripts

Journalistic interviews — attribute quotes precisely without having to re-listen to the recording before publication
User research sessions — separate moderator questions from participant responses for faster analysis and tagging
Podcast production — generate a show-notes transcript that clearly shows which co-host said what
Legal and compliance recordings — meeting or call recordings where speaker attribution is a documentation requirement
Academic focus groups — multi-participant discussions where individual responses need to be coded separately
Sales call analysis — review recorded discovery or demo calls with clear separation between rep and prospect

Ready to transcribe? Upload your file and get speaker labels and timestamps in under 30 seconds.

Start Transcribing Free →

No account · No credit card · 2 free/day