OpenAI Realtime API

Voice agent infrastructure · by OpenAI · official site

OpenAI Realtime API Review

Voice agent infrastructure is a crowded space, but OpenAI’s Realtime API is one of the few options that bundles speech-to-text, large language model reasoning, and text-to-speech into a single streaming endpoint. As of 2026, it remains a strong candidate for teams that want to ship a capable voice bot without stitching together multiple vendors—provided they can stomach the pricing and accept some loss of control.

What it actually does

The Realtime API accepts a persistent WebSocket connection. You send raw audio (or use server‑side voice activity detection to capture chunks), and the API returns streaming audio that contains the model’s spoken response. Internally, it runs an end‑to‑end pipeline: an ASR model (OpenAI’s Whisper variant) transcribes speech, a GPT‑4o‑class model generates a reply, and a neural TTS engine (likely a successor to their “alloy” voice) synthesizes the output. The entire round‑trip is designed to stay under 300ms of latency for simple queries. The API handles interruptions, barge‑in, and turn‑taking automatically, though the exact behavior is configurable via session parameters.

Who it's for

This tool is aimed at teams building production voice agents—customer support bots, receptionist assistants, or voice‑enabled copilots—where the total cost of integration is more important than per‑call cost. It’s a good fit when you want to send audio in, get audio out, and not think about component tuning or infrastructure scaling. If you are a solo developer or a small startup that needs a demo in a week and can accept vendor lock‑in, the Realtime API is a fast path to a working prototype.

What works

Latency for simple turns – For a single user utterance that requires no external API calls, the round‑trip is consistently below 350ms (often around 200ms with GPT‑4o‑mini variant). This is competitive with custom‑built pipelines using Whisper + small LLM + Piper TTS, but without the operational overhead.
Barge‑in handling – The API correctly stops synthesis when the user speaks again, and it can be configured to either flush the current context or continue. This works more reliably than many open‑source implementations.
Language coverage – Supports dozens of languages out of the box; ASR accuracy in major European and East Asian languages is comparable to dedicated speech providers. The TTS voice quality has improved significantly since 2024, with reduced robotic artifacts and better prosody.
Session management – You can maintain long‑lived conversations (up to 30 minutes per session) with a single connection. Reconnections are handled at the application layer, but the API provides a consistent session ID.

What breaks

High latency for long or complex responses – If the model has to generate a multi‑paragraph reply or integrate retrieved information, end‑to‑end latency can spike to 2–3 seconds. The API does not support streaming TTS before the full response is ready, so the user hears silence.
Voice activity detection (VAD) inconsistencies – The built‑in VAD has variable sensitivity across noise conditions. In a quiet office it works well; in a car or crowded cafe, it often cuts off the beginning of sentences or fails to detect the end of speech, leading to false timeouts.
Cost accumulation at scale – See pricing below. For a bot that handles 10,000 calls a day (~15 minutes average), monthly bills push into low five figures. This is not a commodity pricing model.
Limited customization – You cannot replace the ASR or TTS components. If you need a custom voice, domain‑specific acoustic model, or a language not well‑covered by Whisper, you’re locked out. OpenAI also controls the exact versions of each component; model deprecation without compat guarantees has happened.
No offline or edge deployment – This is strictly a cloud API. Any network disruption kills the session. For latency‑sensitive telephony or embedded use cases, the 50–100ms round‑trip overhead of the internet connection adds a floor to your response time.

Pricing reality

As of mid‑2026, OpenAI bills the Realtime API on a per‑audio‑minute basis for both input and output. The published rates are:

Audio input: $0.06 per minute (USD)
Audio output: $0.24 per minute (USD)
Text tokens (the LLM inference behind the audio) are included in the audio minute price for the default tier, but a higher‑quality tier charges additional per‑token fees.

A typical 10‑minute call with 2 minutes of user speech and 3 minutes of assistant speech incurs roughly:

Input: 2 min × $0.06 = $0.12
Output: 3 min × $0.24 = $0.72
Total per call: $0.84

At 10,000 calls per month, that’s $8,400. Volume discounts start above 1M minutes per month, but they are negotiated individually and are not public. There is no free tier; only a $5 credit for new accounts.

The honest comparison

vs. Deepgram Aura (voice agent stack) – Deepgram now offers a comparable “Aura” streaming agent. Their ASR is cheaper ($0.0043/min input) and their TTS is $0.015/min. However, you still need to bring your own LLM (or use their LLM integration) and manage state. Total per‑call cost can be 3–5x lower than OpenAI, but you trade off integration effort and the reliability of a single‑vendor pipeline. For high‑volume flows, Deepgram’s per‑minute pricing wins. vs. Google Agent with Vertex AI – Google’s stack combines Chirp ASR, Gemini, and Chirp TTS. Latency is comparable (200–400ms), and pricing is slightly cheaper for output audio ($0.18/min) but requires managing tooling for session state. Google offers regional and HIPAA compliance, which OpenAI does not. The trade‑off: lower latency variance but more pieces to cart. vs. Custom (Whisper + GPT + Piper) – The DIY approach can cut costs to $0.05–0.10 per call if you self‑host small models (Whisper‑medium, Llama‑3.1‑8B, Piper), but latency degrades unless you have GPU capacity close to your users. For a startup with <1000 calls/day, the engineering time to build and tune a custom stack often exceeds the API cost savings.

When to use it

Use the OpenAI Realtime API when you need a working voice agent in days, your call volume is moderate (<50,000 minutes/month), and you are willing to pay a premium to avoid stiching together speech, language, and synthesis.

Last verified: 2026-06-08 by kernel.