OpenAI Realtime API with WebRTC


OpenAI provides a powerful set of APIs for realtime interaction with its models. While working on a client project, I needed to implement realtime capabilities as a feature in a web application. OpenAI supports two transport methods for the Realtime API: WebSocket and WebRTC. I have already written a detailed article on how to implement a WebSocket connection with OpenAI Realtime, along with a minimal demo example.

WebSocket provides lower level interfaces for audio input and output, and OpenAI exposes more events for granular audio handling over WebSocket. However, the recommended approach for browser clients is WebRTC. This is because WebRTC is more reliable and consistent than WebSocket when running directly in the browser.

In this article we will learn how to use WebRTC in the browser, stream voice from the microphone to OpenAI, and receive an audio response along with transcripts from both sides.

Creating a Session

OpenAI supports two approaches for connecting to the Realtime API: the unified interface and ephemeral API keys. The unified interface is simpler but requires a server to establish the connection. Ephemeral keys are short lived tokens that the client can use to establish the connection directly. We will cover both methods in this article.

Setting up the Peer Connection

const peerConnection = new RTCPeerConnection();
const audioElement = new Audio();
audioElement.autoplay = true;
peerConnection.ontrack = (event) => {
    audioElement.srcObject = event.streams[0];
};

const userAudioStream = await navigator.mediaDevices.getUserMedia({
    audio: true,
    video: false,
});

const audioTrack = userAudioStream.getAudioTracks()[0];
if (!audioTrack) {
    console.error("No audio tracks found in the user media stream.");
    peerConnection.close();
    return;
}
peerConnection.addTrack(audioTrack);

const dataChannel = peerConnection.createDataChannel("dc-events");

const offer = await peerConnection.createOffer();
if (!offer.sdp) {
    console.error("Failed to create SDP offer.");
    peerConnection.close();
    return;
}
await peerConnection.setLocalDescription(offer);

First, we create a peer connection and attach an ontrack handler that routes incoming audio tracks from OpenAI into the audio element. We use the MediaDevices API to access the microphone and add the audio track to the peer connection so that OpenAI can process the audio directly. We also create a data channel on this connection to send and receive events. Finally, we generate a local Session Description Protocol (SDP) offer, which will be used to initiate the session with OpenAI.

Via Unified Interface

Let’s create a session using the unified interface. In this approach, the server proxies the SDP exchange so that the API key is never exposed to the client. You can use any backend technology for this part. I also have an example in PHP.

const openAIApiKey = import.meta.env.OPENAI_API_KEY;
const sessionConfig = {
    type: "realtime",
    model: "gpt-realtime-mini", // use model that support realtime
};
const fd = new FormData();
fd.set("sdp", offer.sdp); // sdp created above section
fd.set("session", JSON.stringify(sessionConfig));

const response = await fetch("https://api.openai.com/v1/realtime/calls", {
    method: "POST",
    headers: {
        Authorization: `Bearer ${openAIApiKey}`, // make sure OpenAI key is available
    },
    body: fd,
});
if (!response.ok) {
    const errorDetails = await response.text();
    console.error(`Failed to create WebRTC session. Status: ${response.status}, Details: ${errorDetails}`),
    return;
}
const sdp = await response.text();
return { sdp };

The code above is straightforward. We send the local SDP offer to the OpenAI Realtime API, and it returns the remote SDP answer needed to complete the peer connection.

Via Ephemeral API key

The other way to create a session is with an ephemeral key. Generating the ephemeral key still requires a server side step, but after that the client handles the SDP exchange directly with OpenAI. You can use any backend technology to generate the token. In this article I am using Astro.js with Astro Actions for server side rendering. The snippet below uses the official OpenAI Node.js package and reads the secret key from an environment variable. The ephemeral token is returned so the client can use it directly.

import { defineAction } from "astro:actions";
import { z } from "astro/zod";
const openAIApiKey = import.meta.env.OPENAI_API_KEY;
import OpenAI from "openai";

export const server = {
    getOpenAIToken: defineAction({
        input: z
            .object({
                expireAfter: z.number().optional(),
            })
            .optional(),
        handler: async (input) => {
            const client = new OpenAI({
                apiKey: openAIApiKey,
            });
            const body: { expires_after?: { seconds: number; anchor: "created_at" } } = {};
            if (input?.expireAfter) {
                body.expires_after = { seconds: input.expireAfter, anchor: "created_at" };
            }

            const clientSecret = await client.realtime.clientSecrets.create(body);
            return { ephemeralKey: clientSecret.value };
        },
    }),
};

The action above returns an ephemeral key and accepts an optional parameter to control how long the token remains valid.

Once you have the ephemeral key on the client, use it to post the SDP offer directly to OpenAI:

const sessionConfig = {
    type: "realtime",
    model: "gpt-realtime-mini", // use model that support realtime
};
const fd = new FormData();
fd.set("sdp", offer.sdp); // sdp created above section
fd.set("session", JSON.stringify(sessionConfig));

const response = await fetch("https://api.openai.com/v1/realtime/calls", {
    method: "POST",
    headers: {
        Authorization: `Bearer ${ephemeralKey}`, // ephemeral key received from server above
    },
    body: fd,
});
if (!response.ok) {
    const errorDetails = await response.text();
    console.error(`Failed to create WebRTC session. Status: ${response.status}, Details: ${errorDetails}`),
    return;
}
const sdp = await response.text();
return { sdp };

This is the same request we made in the Unified Interface section, except now it runs entirely on the client using the short lived ephemeral key instead of the main API key.

Completing the Connection Setup

Once you have received the remote SDP, regardless of which approach you used, set it on the peer connection to complete the WebRTC handshake.

await peerConnection.setRemoteDescription({ type: "answer", sdp: sdp }); // remote SDP

Setting Up Session Initialization

Now that the peer connection is established, we need to send a session.update client event to configure the session with the parameters our application requires.

const initializeRealtime = async () => {
    const initEvent = {
        type: "session.update",
        session: {
            type: "realtime",
            instructions: "You are a helpful assistant that provides concise answers to user queries.",
            audio: {
                output: {
                    format: {
                        rate: 24000,
                        type: "audio/pcm",
                    },
                    voice: "cedar",
                },
                input: {
                    turn_detection: null,
                    transcription: {
                        model: "gpt-4o-mini-transcribe",
                        language: "en",
                        prompt: "expect words from a user who is not a native English speaker.",
                    },
                },
            },
        },
    };
    dataChannel.send(JSON.stringify(initEvent));
};

The configuration above sets the system instructions and audio parameters for both input and output. For a full list of available options, refer to the session.update event documentation.

The default modality is audio, which means the model will return both an audio response and an audio transcript. Output audio uses PCM format at 24kHz with the cedar voice. Input audio is sent in the same format directly from the microphone. We have also enabled input transcription using the gpt-4o-mini-transcribe model, which you can remove if transcription is not needed. Voice activity detection (VAD) is disabled so that we can manually control when to trigger a response.

The best place to call initializeRealtime is inside the data channel’s onopen event handler so it runs as soon as the channel is ready.

dataChannel.onopen = () => {
    initializeRealtime(dataChannel);
};

Handling Server Events

We need to handle incoming server events to process the AI audio transcript and the user audio transcript. OpenAI sends different event types depending on the session modality.

Handling the AI Audio Transcript

When the session modality is set to [audio] (the default), OpenAI sends response.output_audio_transcript.delta events containing transcript fragments as the response streams in. When the modality is set to [text], it sends response.output_text.delta instead.

dataChannel.onmessage = (event: MessageEvent) => {
    const data = JSON.parse(event.data);
    // when modality is set to [audio]
    if (data.type === "response.output_audio_transcript.delta") {
        const aiTranscriptText = data.delta;
        // use this text as per your app requirement (you can stream it)
    }

    // when modality set to [text]
    if (data.type === "response.output_text.delta") {
        const aiTranscriptText = data.delta;
        // use this text as per your app requirement (you can stream it)
    }

};

To receive the complete finished transcript rather than fragments, listen for response.output_audio_transcript.done or response.output_text.done:

dataChannel.onmessage = (event: MessageEvent) => {
    const data = JSON.parse(event.data);
    // when modality is set to [audio]
    if (data.type === "response.output_audio_transcript.done") {
        const aiTranscriptText = data.transcript;
    }

    // when modality set to [text]
    if (data.type === "response.output_text.done") {
        const aiTranscriptText = data.text;
    }
};

Handling the User Audio Transcript

To receive the user’s spoken transcript (if transcription was enabled during session configuration), listen for conversation.item.input_audio_transcription.delta:

dataChannel.onmessage = (event: MessageEvent) => {
    const data = JSON.parse(event.data);
    if (data.type === "conversation.item.input_audio_transcription.delta") {
        const userTranscriptText = data.delta;
    }
};

To receive the complete transcript instead of fragments, use conversation.item.input_audio_transcription.completed:

dataChannel.onmessage = (event: MessageEvent) => {
    const data = JSON.parse(event.data);
    if (data.type === "conversation.item.input_audio_transcription.completed") {
        const userTranscriptText = data.transcript;
    }
};

When a response is interrupted or cancelled, you can detect this using response.done with a status of cancelled:

dataChannel.onmessage = (event: MessageEvent) => {
    const data = JSON.parse(event.data);
    if (data.type === "response.done" && data.response?.status === "cancelled") {
        // when user response is cancelled
    }
};

You can also listen for response.output_audio.done if you need to take action after OpenAI has finished sending the audio response:

dataChannel.onmessage = (event: MessageEvent) => {
    const data = JSON.parse(event.data);
    if (data.type === "response.output_audio.done") {
    // fired when OpenAI Realtime API has finished sending audio chunks for a response
    }
};

OpenAI has detailed documentation on server events covering all available event types. Note that not all events are available over a WebRTC connection. Some are WebSocket specific.

Handling Client Events

Since we are manually controlling the audio flow, we need to use specific client events to manage the conversation.

Send these two events once the user has finished speaking to commit the audio buffer and trigger a response from OpenAI:

dataChannel.send(JSON.stringify({ type: "input_audio_buffer.commit" }));
dataChannel.send(JSON.stringify({ type: "response.create" }));

To clear the audio buffer before starting a fresh turn:

dataChannel.send(JSON.stringify({ type: "input_audio_buffer.clear" }));

To cancel or interrupt an ongoing response:

ataChannel.send(JSON.stringify({ type: "output_audio_buffer.clear" }));
dataChannel.send(JSON.stringify({ type: "response.cancel" }));

OpenAI provides several other client events depending on your application requirements. Again, not all events are available over a WebRTC connection. Some are WebSocket specific.

More Granular Control Over User Audio

In the peer connection setup above, we used the MediaDevices API to capture audio directly from the microphone. While this works, I recommend setting it up separately with more explicit control over muting and unmuting, so you can manage the conversation flow precisely.

if (!navigator.mediaDevices?.getUserMedia) {
    console.error("MediaDevices API or getUserMedia is not supported in this browser.");
    return null;
}
const stream = await navigator.mediaDevices.getUserMedia({
    audio: {
        sampleRate: 24000,
        channelCount: 1, // mono audio for OpenAI Realtime API
        echoCancellation: true,
        noiseSuppression: true,
    },
    video: false,
});

const muteAudioStream = (stream: MediaStream | null) => {
    if (!stream) return;
    stream.getAudioTracks().forEach((track) => {
        track.enabled = false;
    });
    console.log("Audio stream muted.");
};

const unmuteAudioStream = (stream: MediaStream | null) => {
    if (!stream) return;
    stream.getAudioTracks().forEach((track) => {
        track.enabled = true;
    });
    console.log("Audio stream unmuted.");
};

The code above creates a microphone stream with the correct audio constraints for the OpenAI Realtime API (24kHz, mono, with echo cancellation and noise suppression). The muteAudioStream and unmuteAudioStream helpers let you control when audio is actually sent, which is essential for a manually controlled conversation flow.

One of the biggest advantages of WebRTC over WebSocket is that you do not need to set up a custom audio processor to format and send microphone audio, you do not need to handle base64 encoded audio chunks, and you do not need an audio worklet to manage the processing pipeline. WebRTC handles all of that transparently, letting OpenAI receive the audio directly through the peer connection and play back the response through the audio element we created.

Putting It All Together

You can view and download the full source code from the GitHub repository: OpenAI Realtime WebRTC.

The repository contains a minimal working demo built with Astro.js where a user can connect their microphone, speak, and receive a response from the OpenAI Realtime API. When the user clicks the Start Speaking button, the application initializes the WebRTC connection, either by proxying the SDP exchange through the server (Unified Interface) or by requesting an ephemeral token, and then updates the session configuration via the data channel.

When the user clicks the Stop Speaking button, the app commits the audio buffer and requests OpenAI to start generating a response. The demo captures transcripts from both the user and the AI and displays them in separate panels, while also playing back the audio response seamlessly through an HTMLAudioElement.

The full source code also includes error handling for WebRTC error events returned by OpenAI, as well as a cleanup routine that stops the microphone stream tracks and closes the peer connection gracefully. Both are important to include in any production application.

If you find any bugs or have questions, feel free to reach out on LinkedIn.