Developer Guide · April 2026

xAI Grok STT and TTS API: A Practical Developer Guide (2026)

By ScribeForge · April 18, 2026 · 10 min read

xAI released its Grok STT API on April 17, 2026 — the first speech-to-text endpoint in the xAI ecosystem. It's designed to be OpenAI-compatible, which means you can plug it into any existing Whisper integration by changing two lines: the api_key and the base_url. This guide covers both STT and TTS end-to-end, with working Python code for every example.

Contents

  1. Setup and authentication
  2. Grok STT: transcribing audio
  3. Parsing the STT response
  4. Grok TTS: generating speech
  5. Error handling and fallbacks
  6. Pricing and cost estimation
  7. Known limitations

1. Setup and authentication

Install the OpenAI Python client — xAI uses the same interface:

bash
pip install openai

Create the client pointing at xAI's endpoint:

python
from openai import OpenAI

client = OpenAI(
    api_key="xai-your-api-key",   # from console.x.ai
    base_url="https://api.x.ai/v1",
)

Get your API key from console.x.ai. Store it as an environment variable — never hardcode it.

python
import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["XAI_API_KEY"],
    base_url=os.getenv("XAI_BASE_URL", "https://api.x.ai/v1"),
)

2. Grok STT: transcribing audio

The STT endpoint follows the OpenAI audio.transcriptions.create interface exactly. Supported formats: MP3, WAV, FLAC, M4A, OGG, OPUS, WEBM, AAC, MP4 (audio extracted). Maximum file size: 25MB.

python
with open("interview.mp3", "rb") as audio_file:
    result = client.audio.transcriptions.create(
        model="grok-stt",
        file=audio_file,
        response_format="json",   # or "verbose_json" for timestamps
    )

print(result.text)

Getting timestamps with verbose_json

Use verbose_json to get segment-level timestamps:

python
with open("meeting.mp3", "rb") as audio_file:
    result = client.audio.transcriptions.create(
        model="grok-stt",
        file=audio_file,
        response_format="verbose_json",
        timestamp_granularities=["segment"],
    )

print(result.text)  # full transcript

for segment in result.segments:
    print(f"{segment.start:.1f}s → {segment.end:.1f}s: {segment.text}")

Language hint

Grok STT auto-detects the language, but you can speed up processing by specifying it:

python
result = client.audio.transcriptions.create(
    model="grok-stt",
    file=audio_file,
    language="it",   # ISO 639-1 language code
)

3. Parsing the STT response

Simple JSON response fields

FieldTypeDescription
textstrFull transcript text

verbose_json additional fields

FieldTypeDescription
textstrFull transcript
languagestrDetected language code
durationfloatAudio duration in seconds
segmentslistList of segment objects
segments[].startfloatSegment start time (seconds)
segments[].endfloatSegment end time (seconds)
segments[].textstrSegment transcript text
segments[].speakerstr|NoneSpeaker label (if diarization available)

4. Grok TTS: generating speech

The TTS endpoint follows audio.speech.create. Supported voices: eve, ara, rex, sal, leo. Output format: MP3.

python
response = client.audio.speech.create(
    model="grok-tts",
    voice="eve",
    input="Hello! This is ScribeForge demonstrating Grok TTS.",
    speed=1.0,   # 0.25 to 4.0
)

# Save to file
with open("output.mp3", "wb") as f:
    f.write(response.read())

Streaming large TTS (chunked)

For long texts, stream the response to avoid loading the entire MP3 into memory:

python
response = client.audio.speech.create(
    model="grok-tts",
    voice="leo",
    input=long_article_text,
    speed=1.0,
)

with open("article.mp3", "wb") as f:
    for chunk in response.iter_bytes(chunk_size=4096):
        f.write(chunk)

Returning audio from a FastAPI endpoint

python
from fastapi import FastAPI
from fastapi.responses import Response

app = FastAPI()

@app.post("/speak")
async def speak(text: str, voice: str = "eve"):
    response = client.audio.speech.create(
        model="grok-tts",
        voice=voice,
        input=text,
    )
    audio_bytes = response.read()
    return Response(content=audio_bytes, media_type="audio/mpeg")

5. Error handling and fallbacks

Grok STT is new — some verbose_json features may not be fully supported yet. Implement a graceful fallback:

python
from openai import APIError

def transcribe_with_fallback(audio_file_path: str) -> dict:
    with open(audio_file_path, "rb") as f:
        audio_bytes = f.read()

    # Try verbose_json first (timestamps + speakers)
    try:
        import io
        result = client.audio.transcriptions.create(
            model="grok-stt",
            file=("audio.mp3", io.BytesIO(audio_bytes), "audio/mpeg"),
            response_format="verbose_json",
            timestamp_granularities=["segment"],
        )
        segments = [
            {"start": s.start, "end": s.end, "text": s.text}
            for s in (result.segments or [])
        ]
        return {"text": result.text, "segments": segments}

    except APIError:
        # Fall back to plain json (text only)
        result = client.audio.transcriptions.create(
            model="grok-stt",
            file=("audio.mp3", io.BytesIO(audio_bytes), "audio/mpeg"),
            response_format="json",
        )
        return {"text": result.text, "segments": []}
Note: Always delete temporary files after processing. Never log or persist user audio content. If you build a public-facing tool, implement rate limiting to protect your xAI API budget.

6. Pricing and cost estimation

APIPrice (approx.)Notes
Grok STT~$0.013/requestPer transcription call, not per minute
Grok TTS~$0.015/1,000 charsPer character of input text

At $0.013/STT call, processing 1,000 recordings costs ~$13. A SaaS charging $9 for 50 uses recovers costs at ~14× margin — comfortable for a bootstrapped product.

7. Known limitations (as of April 2026)

Want to see a working production implementation? ScribeForge is open to inspect: the entire main.py is a FastAPI app wrapping the xAI audio APIs with Stripe payments and a free tier. The code patterns in this guide are lifted directly from it.

Try the xAI Grok STT and TTS APIs in your browser — no API key needed. Free tier, no account.

Try ScribeForge →