OpenAI Realtime API

Voice agent infrastructure · by OpenAI · official site

OpenAI Realtime API Review

Voice agent infrastructure is a crowded space, but OpenAI’s Realtime API is one of the few options that bundles speech-to-text, large language model reasoning, and text-to-speech into a single streaming endpoint. As of 2026, it remains a strong candidate for teams that want to ship a capable voice bot without stitching together multiple vendors—provided they can stomach the pricing and accept some loss of control.

What it actually does

The Realtime API accepts a persistent WebSocket connection. You send raw audio (or use server‑side voice activity detection to capture chunks), and the API returns streaming audio that contains the model’s spoken response. Internally, it runs an end‑to‑end pipeline: an ASR model (OpenAI’s Whisper variant) transcribes speech, a GPT‑4o‑class model generates a reply, and a neural TTS engine (likely a successor to their “alloy” voice) synthesizes the output. The entire round‑trip is designed to stay under 300ms of latency for simple queries. The API handles interruptions, barge‑in, and turn‑taking automatically, though the exact behavior is configurable via session parameters.

Who it's for

This tool is aimed at teams building production voice agents—customer support bots, receptionist assistants, or voice‑enabled copilots—where the total cost of integration is more important than per‑call cost. It’s a good fit when you want to send audio in, get audio out, and not think about component tuning or infrastructure scaling. If you are a solo developer or a small startup that needs a demo in a week and can accept vendor lock‑in, the Realtime API is a fast path to a working prototype.

What works

What breaks

Pricing reality

As of mid‑2026, OpenAI bills the Realtime API on a per‑audio‑minute basis for both input and output. The published rates are:

A typical 10‑minute call with 2 minutes of user speech and 3 minutes of assistant speech incurs roughly: At 10,000 calls per month, that’s $8,400. Volume discounts start above 1M minutes per month, but they are negotiated individually and are not public. There is no free tier; only a $5 credit for new accounts.

The honest comparison

vs. Deepgram Aura (voice agent stack) – Deepgram now offers a comparable “Aura” streaming agent. Their ASR is cheaper ($0.0043/min input) and their TTS is $0.015/min. However, you still need to bring your own LLM (or use their LLM integration) and manage state. Total per‑call cost can be 3–5x lower than OpenAI, but you trade off integration effort and the reliability of a single‑vendor pipeline. For high‑volume flows, Deepgram’s per‑minute pricing wins. vs. Google Agent with Vertex AI – Google’s stack combines Chirp ASR, Gemini, and Chirp TTS. Latency is comparable (200–400ms), and pricing is slightly cheaper for output audio ($0.18/min) but requires managing tooling for session state. Google offers regional and HIPAA compliance, which OpenAI does not. The trade‑off: lower latency variance but more pieces to cart. vs. Custom (Whisper + GPT + Piper) – The DIY approach can cut costs to $0.05–0.10 per call if you self‑host small models (Whisper‑medium, Llama‑3.1‑8B, Piper), but latency degrades unless you have GPU capacity close to your users. For a startup with <1000 calls/day, the engineering time to build and tune a custom stack often exceeds the API cost savings.

When to use it

Use the OpenAI Realtime API when you need a working voice agent in days, your call volume is moderate (<50,000 minutes/month), and you are willing to pay a premium to avoid stiching together speech, language, and synthesis.

Last verified: 2026-06-08 by kernel.