Developer Guide · April 2026

xAI Grok STT and TTS API: A Practical Developer Guide (2026)

By ScribeForge · April 18, 2026 · 10 min read

ScribeForge uses the xAI Grok STT API for browser-based transcription. This guide documents both the STT and TTS endpoints of the xAI API for developers building their own integrations. Try ScribeForge free →

xAI released its Grok STT API on April 17, 2026 — the first speech-to-text endpoint in the xAI ecosystem. It's designed to be OpenAI-compatible, which means you can plug it into any existing Whisper integration by changing two lines: the api_key and the base_url. This guide covers both STT and TTS end-to-end, with working Python code for every example. If you want to validate the model's accuracy before committing to API setup, scribeforge.tech wraps the same Grok STT endpoint behind a free 2/day browser interface — useful for a quick sanity-check on your audio.

Setup and authentication
Grok STT: transcribing audio
Parsing the STT response
Grok TTS: generating speech
Error handling and fallbacks
Pricing and cost estimation
Known limitations

1. Setup and authentication

Install the OpenAI Python client — xAI uses the same interface:

bash

pip install openai

Create the client pointing at xAI's endpoint:

python

from openai import OpenAI

client = OpenAI(
    api_key="xai-your-api-key",   # from console.x.ai
    base_url="https://api.x.ai/v1",
)

Get your API key from console.x.ai. Store it as an environment variable — never hardcode it.

python

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["XAI_API_KEY"],
    base_url=os.getenv("XAI_BASE_URL", "https://api.x.ai/v1"),
)

2. Grok STT: transcribing audio

The STT endpoint follows the OpenAI audio.transcriptions.create interface exactly. Supported formats: MP3, WAV, FLAC, M4A, OGG, OPUS, WEBM, AAC, MP4 (audio extracted). Maximum file size on ScribeForge: 100 MB.

python

with open("interview.mp3", "rb") as audio_file:
    result = client.audio.transcriptions.create(
        model="grok-stt",
        file=audio_file,
        response_format="json",   # or "verbose_json" for timestamps
    )

print(result.text)

Getting timestamps with verbose_json

Use verbose_json to get segment-level timestamps:

python

with open("meeting.mp3", "rb") as audio_file:
    result = client.audio.transcriptions.create(
        model="grok-stt",
        file=audio_file,
        response_format="verbose_json",
        timestamp_granularities=["segment"],
    )

print(result.text)  # full transcript

for segment in result.segments:
    print(f"{segment.start:.1f}s → {segment.end:.1f}s: {segment.text}")

Language hint

Grok STT auto-detects the language, but you can speed up processing by specifying it:

python

result = client.audio.transcriptions.create(
    model="grok-stt",
    file=audio_file,
    language="it",   # ISO 639-1 language code
)

3. Parsing the STT response

Simple JSON response fields

Field	Type	Description
text	str	Full transcript text

verbose_json additional fields

Field	Type	Description
text	str	Full transcript
language	str	Detected language code
duration	float	Audio duration in seconds
segments	list	List of segment objects
segments[].start	float	Segment start time (seconds)
segments[].end	float	Segment end time (seconds)
segments[].text	str	Segment transcript text
segments[].speaker	str\|None	Speaker label (if diarization available)

4. Grok TTS: generating speech

The TTS endpoint follows audio.speech.create. Supported voices: eve, ara, rex, sal, leo. Output format: MP3.

python

response = client.audio.speech.create(
    model="grok-tts",
    voice="eve",
    input="Hello! This is ScribeForge demonstrating Grok TTS.",
    speed=1.0,   # 0.25 to 4.0
)

# Save to file
with open("output.mp3", "wb") as f:
    f.write(response.read())

Streaming large TTS (chunked)

For long texts, stream the response to avoid loading the entire MP3 into memory:

python

response = client.audio.speech.create(
    model="grok-tts",
    voice="leo",
    input=long_article_text,
    speed=1.0,
)

with open("article.mp3", "wb") as f:
    for chunk in response.iter_bytes(chunk_size=4096):
        f.write(chunk)

Returning audio from a FastAPI endpoint

python

from fastapi import FastAPI
from fastapi.responses import Response

app = FastAPI()

@app.post("/speak")
async def speak(text: str, voice: str = "eve"):
    response = client.audio.speech.create(
        model="grok-tts",
        voice=voice,
        input=text,
    )
    audio_bytes = response.read()
    return Response(content=audio_bytes, media_type="audio/mpeg")

5. Error handling and fallbacks

Grok STT is new — some verbose_json features may not be fully supported yet. Implement a graceful fallback:

python

from openai import APIError

def transcribe_with_fallback(audio_file_path: str) -> dict:
    with open(audio_file_path, "rb") as f:
        audio_bytes = f.read()

    # Try verbose_json first (timestamps + speakers)
    try:
        import io
        result = client.audio.transcriptions.create(
            model="grok-stt",
            file=("audio.mp3", io.BytesIO(audio_bytes), "audio/mpeg"),
            response_format="verbose_json",
            timestamp_granularities=["segment"],
        )
        segments = [
            {"start": s.start, "end": s.end, "text": s.text}
            for s in (result.segments or [])
        ]
        return {"text": result.text, "segments": segments}

    except APIError:
        # Fall back to plain json (text only)
        result = client.audio.transcriptions.create(
            model="grok-stt",
            file=("audio.mp3", io.BytesIO(audio_bytes), "audio/mpeg"),
            response_format="json",
        )
        return {"text": result.text, "segments": []}

Note: Always delete temporary files after processing. Never log or persist user audio content. If you build a public-facing tool, implement rate limiting to protect your xAI API budget.

6. Pricing and cost estimation

API	Price (approx.)	Notes
Grok STT	~$0.013/request	Per transcription call, not per minute
Grok TTS	~$0.015/1,000 chars	Per character of input text

At $0.013/STT call, processing 1,000 recordings costs ~$13. A SaaS charging $9 for 50 uses recovers costs at ~14× margin — comfortable for a bootstrapped product.

7. Known limitations (as of April 2026)

Batch only: No real-time streaming STT yet. Response comes after the full file is processed.
Max file size on ScribeForge: 100 MB. For longer recordings, split with ffmpeg first.
No word-level timestamps: Segment-level only. For word-level, Deepgram Nova-2 is currently the better choice.
TTS output format: MP3 only (no WAV, OGG, or FLAC output yet).
Concurrent request limits: Not officially published. Be conservative in burst scenarios.

  Want to see a working production implementation? ScribeForge is open to inspect: the entire main.py is a FastAPI app wrapping the xAI STT API with Stripe payments and a free tier. The STT code patterns in this guide are lifted directly from it.

xAI Grok STT and TTS API: A Practical Developer Guide (2026)

Contents

1. Setup and authentication

2. Grok STT: transcribing audio

Getting timestamps with verbose_json

Language hint

3. Parsing the STT response

Simple JSON response fields

verbose_json additional fields

4. Grok TTS: generating speech

Streaming large TTS (chunked)

Returning audio from a FastAPI endpoint

5. Error handling and fallbacks

6. Pricing and cost estimation

7. Known limitations (as of April 2026)

Related reading