OpenAI Launches New Realtime Voice Models for Reasoning, Translation, and Transcription in the API

OpenAI has released three new realtime audio models in the API - GPT‑Realtime‑2 with GPT‑5‑class reasoning, GPT‑Realtime‑Translate for live multilingual translation across 70+ languages, and GPT‑Realtime‑Whisper for streaming speech-to-text - enabling developers to build voice experiences that can reason, translate, transcribe, and take action during live conversations.

openai May 7, 2026

OpenAI is introducing three new audio models in the API designed to enable a new category of voice applications for developers. These models allow developers to create voice experiences that feel more natural, respond more intelligently, and take action in real time:

  • GPT‑Realtime‑2 - OpenAI's first voice model with GPT‑5‑class reasoning, capable of handling more difficult requests and carrying conversations forward naturally.
  • GPT‑Realtime‑Translate - a live translation model that converts speech from over 70 input languages into 13 output languages while keeping up with the speaker.
  • GPT‑Realtime‑Whisper - a streaming speech-to-text model that transcribes speech in real time as the speaker talks.

Voice as an Interface Between People and Products

Voice is rapidly becoming one of the most natural ways for people to interact with software - whether asking for help while driving, adjusting travel plans at an airport, getting support in a preferred language, or moving through a task hands-free.

However, building truly useful voice products requires more than fast turn-taking or natural-sounding output. A voice agent must understand intent, maintain context, recover when requests change, use tools mid-conversation, and respond appropriately to the moment.

Developers are building around three emerging patterns in voice AI:

  • Voice-to-action - Users describe what they need and the system reasons through the request, invokes tools, and completes the task. For instance, Zillow is building an assistant that listens, reasons, and acts on requests like finding homes within a budget, avoiding busy streets, and scheduling a tour.
  • Systems-to-voice - Software turns contextual information into live spoken guidance. A travel app, for example, could proactively inform a traveler about a flight delay, suggest a new gate, map a route through the terminal, and confirm baggage transfer.
  • Voice-to-voice - AI facilitates live conversations across languages, tasks, or changing context. Deutsche Telekom is building voice support where customers speak in their most comfortable language while the model translates the conversation in real time.

These patterns can also combine. Priceline is working toward a future where travelers manage entire trips by voice - searching for flights and hotels conversationally, handling changes like adjusting reservations after delays, getting real-time updates on TSA wait times, and translating conversations on the ground.

Realtime Voice: Helping Voice Models Reason and Take Action

GPT‑Realtime‑2 is designed for live voice interactions where the model keeps the conversation moving while reasoning through requests, calling tools, handling corrections or interruptions, and responding contextually. Key capabilities include:

  • Preambles: Developers can enable short transitional phrases like "let me check that" or "one moment" so users know the agent is actively working.
  • Parallel tool calls and tool transparency: The model can invoke multiple tools simultaneously and narrate those actions with phrases like "checking your calendar," keeping agents responsive during task completion.
  • Stronger recovery behavior: The model recovers more gracefully with phrases like "I'm having trouble with that right now" instead of failing silently.
  • Longer context for agentic workflows: The context window has been expanded from 32K to 128K tokens to support longer, more coherent sessions and complex task flows.
  • Stronger domain understanding: The model better retains specialized terminology, proper nouns, healthcare terms, and other domain-specific vocabulary important in production settings.
  • More controllable tone and delivery: The model adjusts its tone more effectively - speaking calmly during issue resolution, empathetically when a user is frustrated, or upbeat when confirming a successful action.
  • Adjustable reasoning effort: Developers can select from minimal, low, medium, high, and xhigh reasoning levels, with low as the default, balancing lower latency for simple interactions with deeper reasoning for complex requests.

On benchmarks closely aligned with production voice agents, GPT‑Realtime‑2 (high) scores 15.2% higher on Big Bench Audio for audio intelligence compared to GPT‑Realtime‑1.5. GPT‑Realtime‑2 (xhigh) scores 13.8% higher on Audio MultiChallenge for instruction following, demonstrating stronger reasoning, context management, and control in live conversations.

Early testing partners have reported strong results. Zillow noted a 26-point lift in call success rate on adversarial benchmarks after prompt optimization (95% vs. 69%) and materially more robust Fair Housing compliance.

Realtime Translation: Live Multilingual Voice Experiences

GPT‑Realtime‑Translate enables developers to build live multilingual voice experiences where each participant can speak in their preferred language and hear the conversation translated in real time, along with real-time transcriptions. It supports more than 70 input languages and 13 output languages, making it suitable for customer support, cross-border sales, education, events, media, and creator platforms with global audiences.

The model is designed to preserve meaning while maintaining pace with the speaker, even when people speak naturally, switch context, or use regional pronunciation and domain-specific language. Deutsche Telekom is testing the model for multilingual voice interactions, and Vimeo has demonstrated how it can translate product education videos live so global customers can hear updates in their preferred language.

BolnaAI reported that in evaluations across Hindi, Tamil, and Telugu, GPT‑Realtime‑Translate delivered 12.5% lower Word Error Rates than any other model tested, along with lower fallback rates, higher task completion, and latency that sustained natural conversation.

Realtime Transcription: Low-Latency Streaming Speech-to-Text

GPT‑Realtime‑Whisper is a streaming transcription model built for low-latency speech-to-text. It transcribes audio as people speak, enabling live products to feel faster, more responsive, and more natural - from captions appearing in the moment to meeting notes keeping up with the conversation.

The model makes live speech usable inside business workflows as it happens. Teams can power captions for meetings, classrooms, broadcasts, and events; generate notes and summaries while conversations are still in progress; build voice agents that need continuous user understanding; and create faster follow-up workflows for customer support, healthcare, sales, recruiting, and other high-volume spoken interactions.

Safety

The Realtime API incorporates multiple layers of safeguards and mitigations to help prevent misuse. Active classifiers monitor Realtime API sessions, meaning certain conversations can be halted if detected as violating OpenAI's harmful content guidelines. Developers can also add their own additional safety guardrails using the Agents SDK.

OpenAI's usage policies prohibit repurposing or distributing outputs for spam, deception, or other harmful purposes. Developers must also make it clear to end users when they are interacting with AI, unless it is already obvious from the context.

The Realtime API fully supports EU Data Residency for EU-based applications and is covered by OpenAI's enterprise privacy commitments.

Pricing and Availability

GPT‑Realtime‑2, GPT‑Realtime‑Translate, and GPT‑Realtime‑Whisper are available in the Realtime API:

  • GPT‑Realtime‑2: $32 per 1M audio input tokens ($0.40 for cached input tokens) and $64 per 1M audio output tokens.
  • GPT‑Realtime‑Translate: $0.034 per minute.
  • GPT‑Realtime‑Whisper: $0.017 per minute.

Developers can test the new realtime voice models in the Playground and start building using Codex.