xAI released its Grok STT API on April 17, 2026 — the first speech-to-text endpoint in the xAI ecosystem. It's designed to be OpenAI-compatible, which means you can plug it into any existing Whisper integration by changing two lines: the api_key and the base_url. This guide covers both STT and TTS end-to-end, with working Python code for every example.
Install the OpenAI Python client — xAI uses the same interface:
bashpip install openai
Create the client pointing at xAI's endpoint:
pythonfrom openai import OpenAI
client = OpenAI(
api_key="xai-your-api-key", # from console.x.ai
base_url="https://api.x.ai/v1",
)
Get your API key from console.x.ai. Store it as an environment variable — never hardcode it.
pythonimport os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["XAI_API_KEY"],
base_url=os.getenv("XAI_BASE_URL", "https://api.x.ai/v1"),
)
The STT endpoint follows the OpenAI audio.transcriptions.create interface exactly. Supported formats: MP3, WAV, FLAC, M4A, OGG, OPUS, WEBM, AAC, MP4 (audio extracted). Maximum file size: 25MB.
with open("interview.mp3", "rb") as audio_file:
result = client.audio.transcriptions.create(
model="grok-stt",
file=audio_file,
response_format="json", # or "verbose_json" for timestamps
)
print(result.text)
Use verbose_json to get segment-level timestamps:
with open("meeting.mp3", "rb") as audio_file:
result = client.audio.transcriptions.create(
model="grok-stt",
file=audio_file,
response_format="verbose_json",
timestamp_granularities=["segment"],
)
print(result.text) # full transcript
for segment in result.segments:
print(f"{segment.start:.1f}s → {segment.end:.1f}s: {segment.text}")
Grok STT auto-detects the language, but you can speed up processing by specifying it:
pythonresult = client.audio.transcriptions.create(
model="grok-stt",
file=audio_file,
language="it", # ISO 639-1 language code
)
| Field | Type | Description |
|---|---|---|
| text | str | Full transcript text |
| Field | Type | Description |
|---|---|---|
| text | str | Full transcript |
| language | str | Detected language code |
| duration | float | Audio duration in seconds |
| segments | list | List of segment objects |
| segments[].start | float | Segment start time (seconds) |
| segments[].end | float | Segment end time (seconds) |
| segments[].text | str | Segment transcript text |
| segments[].speaker | str|None | Speaker label (if diarization available) |
The TTS endpoint follows audio.speech.create. Supported voices: eve, ara, rex, sal, leo. Output format: MP3.
response = client.audio.speech.create(
model="grok-tts",
voice="eve",
input="Hello! This is ScribeForge demonstrating Grok TTS.",
speed=1.0, # 0.25 to 4.0
)
# Save to file
with open("output.mp3", "wb") as f:
f.write(response.read())
For long texts, stream the response to avoid loading the entire MP3 into memory:
pythonresponse = client.audio.speech.create(
model="grok-tts",
voice="leo",
input=long_article_text,
speed=1.0,
)
with open("article.mp3", "wb") as f:
for chunk in response.iter_bytes(chunk_size=4096):
f.write(chunk)
from fastapi import FastAPI
from fastapi.responses import Response
app = FastAPI()
@app.post("/speak")
async def speak(text: str, voice: str = "eve"):
response = client.audio.speech.create(
model="grok-tts",
voice=voice,
input=text,
)
audio_bytes = response.read()
return Response(content=audio_bytes, media_type="audio/mpeg")
Grok STT is new — some verbose_json features may not be fully supported yet. Implement a graceful fallback:
from openai import APIError
def transcribe_with_fallback(audio_file_path: str) -> dict:
with open(audio_file_path, "rb") as f:
audio_bytes = f.read()
# Try verbose_json first (timestamps + speakers)
try:
import io
result = client.audio.transcriptions.create(
model="grok-stt",
file=("audio.mp3", io.BytesIO(audio_bytes), "audio/mpeg"),
response_format="verbose_json",
timestamp_granularities=["segment"],
)
segments = [
{"start": s.start, "end": s.end, "text": s.text}
for s in (result.segments or [])
]
return {"text": result.text, "segments": segments}
except APIError:
# Fall back to plain json (text only)
result = client.audio.transcriptions.create(
model="grok-stt",
file=("audio.mp3", io.BytesIO(audio_bytes), "audio/mpeg"),
response_format="json",
)
return {"text": result.text, "segments": []}
| API | Price (approx.) | Notes |
|---|---|---|
| Grok STT | ~$0.013/request | Per transcription call, not per minute |
| Grok TTS | ~$0.015/1,000 chars | Per character of input text |
At $0.013/STT call, processing 1,000 recordings costs ~$13. A SaaS charging $9 for 50 uses recovers costs at ~14× margin — comfortable for a bootstrapped product.
main.py is a FastAPI app wrapping the xAI audio APIs with Stripe payments and a free tier. The code patterns in this guide are lifted directly from it.
Try the xAI Grok STT and TTS APIs in your browser — no API key needed. Free tier, no account.
Try ScribeForge →