OpenAI Realtime API with WebSocket in the Browser


While working on a client project, we needed to explore the possibility of adding realtime conversation with an AI agent. Since the application is browser based, we had the option of using WebSocket or WebRTC. We tried both and settled with the WebRTC version. However, the WebSocket version is equally functional and worked really well. If interested, you can read my article on using WebRTC with OpenAI realtime API. This article is all about using WebSocket with the OpenAI Realtime API.

We will learn how to send live audio using a microphone and then receive an audio response, audio transcription, and user transcription in realtime from OpenAI. We will also play back the audio response so that you can hear it.

The recommended way to connect with the OpenAI Realtime API in a client side environment like the browser is WebRTC, but in this article we will show how to use WebSocket in the browser with an ephemeral token. Note that using a short lived ephemeral token mitigates security risk, but it does not completely eliminate it. It is recommended to avoid using any kind of token in a client side environment, but if you must, use an ephemeral token instead of your main API key.

Creating an Ephemeral Token

To connect the browser with the OpenAI Realtime API using WebSocket, we need an ephemeral token. To create one, we need server side (backend) code. You are free to use any backend technology to generate the token (Node, Python, Java, PHP, Go, etc). In this article, I am using Astro.js server side rendering with Astro Actions. In the code snippet below, we are using the official OpenAI Node.js package and reading the secret key from an environment variable. We return the ephemeral token so that it can be used on the client.

import { defineAction } from "astro:actions";
import { z } from "astro/zod";
const openAIApiKey = import.meta.env.OPENAI_API_KEY;
import OpenAI from "openai";

export const server = {
    getOpenAIToken: defineAction({
        input: z
            .object({
                expireAfter: z.number().optional(),
            })
            .optional(),
        handler: async (input) => {
            const client = new OpenAI({
                apiKey: openAIApiKey,
            });
            const body: { expires_after?: { seconds: number; anchor: "created_at" } } = {};
            if (input?.expireAfter) {
                body.expires_after = { seconds: input.expireAfter, anchor: "created_at" };
            }

            const clientSecret = await client.realtime.clientSecrets.create(body);
            return { ephemeralKey: clientSecret.value };
        },
    }),
};

The important point to note is that we need to return the ephemeral token regardless of which backend technology you are using.

Create a WebSocket Connection

Using the ephemeral token, let’s create a WebSocket connection with the OpenAI Realtime API. In this example, we are setting the token expiry time to 5 minutes, meaning it cannot be used to create a session after that.

const { data, error } = await actions.getOpenAIToken({ expireAfter: 60 * 5 });
if (error) {
    console.error("Error fetching OpenAI token:", error);
    return;
}

const socketConnection = new WebSocket("wss://api.openai.com/v1/realtime?model=gpt-realtime-mini", ["realtime", "openai-insecure-api-key." + data.ephemeralKey]);

socketConnection.onopen = async () => {
    console.log("WebSocket connection established with OpenAI Realtime API.");
};

socketConnection.onclose = (event: CloseEvent) => {
    console.log("WebSocket closed:", event.code, event.reason);
};

socketConnection.onerror = (event: Event) => {
    console.error("WebSocket error:", event);
};

socketConnection.onmessage = (event: MessageEvent) => {
    const data = JSON.parse(event.data);
};

The code above is a basic example to create a session for conversation with OpenAI and listen for incoming events. You can use any HTTP request here to fetch the ephemeral token from your backend of choice. In this example we are using the gpt-realtime-mini model.

Updating Session Configuration

Now that the session is created, we need to send a client event to update the session with the required parameters for our application.

const initializeRealtime = async () => {
    const initEvent = {
        type: "session.update",
        session: {
            type: "realtime",
            instructions: "You are a helpful assistant that provides concise answers to user queries.",
            audio: {
                output: {
                    format: {
                        rate: 24000,
                        type: "audio/pcm",
                    },
                    voice: "cedar",
                },
                input: {
                    turn_detection: null,
                    transcription: {
                        model: "gpt-4o-mini-transcribe",
                        language: "en",
                        prompt: "expect words from a user who is not a native English speaker.",
                    },
                },
            },
        },
    };
    socketConnection.send(JSON.stringify(initEvent));
};

In the code above, we have configured instructions, and various input and output audio parameters. To view all available options, visit the session.update event documentation. Our default modality is audio, which allows us to receive both an audio response and an audio transcript of the model response. This is important if you want the transcript alongside the audio (base64 encoded string). The output audio uses PCM format with a 24kHz sample rate. We are using the cedar voice, but you can choose any other supported voice.

For input audio, we will send audio in the same format (which is the default) directly from the microphone. We have also enabled transcription of the input audio using the gpt-4o-mini-transcribe model (you can remove this if you do not need it). In this example, we have disabled voice activity detection (VAD) so that we can manually trigger the response.

Setting Up the User Audio Flow

Now that the WebSocket connection is up and running, our next goal is to create a media stream to capture microphone audio.

Connecting the Microphone

if (!navigator.mediaDevices?.getUserMedia) {
    console.error("MediaDevices API or getUserMedia is not supported in this browser.");
    return null;
}
const stream = await navigator.mediaDevices.getUserMedia({
    audio: {
        sampleRate: 24000,
        channelCount: 1, // mono audio for OpenAI Realtime API
        echoCancellation: true,
        noiseSuppression: true,
    },
    video: false,
});

We are using the built in MediaDevices API to capture audio. Video is not needed in this example.

Browser Compatibility: AudioWorklet requires a secure context to function. This means your page must be served over HTTPS or from localhost. It will silently fail on plain HTTP, which can be a frustrating debugging experience if you are not aware of it.

Creating an Audio Processor

Our next step is to set up a pipeline to capture raw audio from the microphone and convert it into a format that the OpenAI API understands. Instead of blocking the main thread, we will use a custom AudioWorklet to process our audio in the background.

if (!stream) {
    console.error("No media stream provided for audio processing.");
    return;
}
const audioContext = new AudioContext({ sampleRate: 24000 });
await audioContext.audioWorklet.addModule("audioProcessor.js");
const source = audioContext.createMediaStreamSource(stream);
const processor = new AudioWorkletNode(audioContext, "audio-processor");
source.connect(processor);

processor.port.onmessage = (event) => {
    const float32Array = event.data.audio;
    if (float32Array.length > 0) {
        const base64Audio = base64EncodeAudio(float32Array);
    }
};

Here, we are using the AudioContext API to process the microphone audio. We set the sample rate to 24000 Hz because that is what the OpenAI API expects. By using AudioWorklet, we run custom audio processing code in a dedicated background thread, keeping the main thread free. We then connect the microphone stream as the source. The AudioWorkletNode API bridges the main thread and the worklet script (audioProcessor.js) running in the background. Connecting the source to the processor routes microphone audio through our custom processing logic. The worklet processes audio in small chunks and sends each chunk back to the main thread via port message events. Finally, we convert each chunk to base64 encoded PCM16 format, ready to be sent to OpenAI.

Custom Background Script

Let’s create the worklet script that processes audio received from the microphone.

// file audioProcessor.js
class AudioProcessor extends AudioWorkletProcessor {
    constructor() {
        super();
        this.bufferSize = 4800; // 200ms (at 24kHz sample rate)
        this.buffer = new Float32Array(this.bufferSize);
        this.bufferIndex = 0;
    }

    process(inputs, outputs, parameters) {
        const input = inputs[0];
        if (input.length > 0) {
            const channelData = input[0];
            for (let i = 0; i < channelData.length; i++) {
                this.buffer[this.bufferIndex++] = channelData[i];
                if (this.bufferIndex >= this.bufferSize) {
                    this.port.postMessage({
                        audio: this.buffer.slice(0),
                    });
                    this.bufferIndex = 0;
                }
            }
        }
        return true;
    }
}

registerProcessor("audio-processor", AudioProcessor);

This class gets built in access to the audio pipeline by extending AudioWorkletProcessor, which allows the browser to feed raw microphone data automatically. Instead of sending thousands of tiny audio chunks every millisecond, we accumulate samples into a buffer first.

  • bufferSize: 4800 means we accumulate 200ms worth of audio at 24kHz (24000 samples/sec x 0.2s = 4800). You can increase or decrease this size depending on your requirements.
  • buffer is the storage array that holds the raw audio samples.
  • bufferIndex tracks how many samples have been written to the buffer.

The process method is called every few milliseconds by the browser with a fresh batch of audio samples, as long as the microphone is active. The inputs parameter is a nested array containing multiple sources, each with multiple channels. Since we are recording mono audio from a single microphone, we only use inputs[0][0]. We copy each sample into the buffer one by one. Once the buffer reaches 4800 samples (200ms), we send it to the main thread via a post message event. Then we reset bufferIndex to 0 and start filling again. Returning true tells the browser to keep calling this method. Finally, we register the processor with the same name used when creating the AudioWorkletNode above.

Converting Float32Array to Base64 Encoded PCM16

In the code example above, base64EncodeAudio is a custom function used to convert Float32Array data to base64 encoded PCM16 data compatible with the OpenAI input audio format. Below is the implementation.

// converts a Float32Array to PCM16 ArrayBuffer to base64-encoded PCM16 data
const base64EncodeAudio = (float32Array: Float32Array): string => {

    const buffer = new ArrayBuffer(float32Array.length * 2);
    const view = new DataView(buffer);
    let offset = 0;
    for (let i = 0; i < float32Array.length; i++, offset += 2) {
        let s = Math.max(-1, Math.min(1, float32Array[i]));
        view.setInt16(offset, s < 0 ? s * 0x8000 : s * 0x7fff, true);
    }

    let binary = "";
    const bytes = new Uint8Array(buffer);
    const chunkSize = 0x8000; // 32KB chunk size
    for (let i = 0; i < bytes.length; i += chunkSize) {
        let chunk = bytes.subarray(i, i + chunkSize);
        binary += String.fromCharCode.apply(null, Array.from(chunk));
    }
    return btoa(binary);
};

Sending Audio to the OpenAI Realtime API

To send the base64 encoded audio data to OpenAI, we use its realtime client events. Although not strictly required, it is good practice to clear the buffer from any previous stream before starting the microphone.

socketConnection.send(JSON.stringify({ type: "input_audio_buffer.clear" }));

Then we append the audio chunks received from the background worklet to the OpenAI Realtime API using the input_audio_buffer.append event type.

const audioEvent = {
    type: "input_audio_buffer.append",
    audio: base64Audio,
};
socketConnection.send(JSON.stringify(audioEvent));

base64Audio is received from the audio processor described in the examples above. Once the user has finished speaking, we commit the buffer and ask OpenAI to generate a response.

socketConnection.send(JSON.stringify({ type: "input_audio_buffer.commit" }));
socketConnection.send(JSON.stringify({ type: "response.create" }));

We first commit the buffered audio and then request the response. If you want to cancel a response mid generation, send the response.cancel event type.

socketConnection.send(JSON.stringify({ type: "response.cancel" }));

OpenAI provides several other client events that you can use depending on your application requirements.

Handling the Realtime Response

We need to handle the incoming audio response, its transcript, and the user audio transcript. To do this we use OpenAI realtime server events.

Handling Audio Response Transcript

When the session modality is set to [audio] (which is the default), OpenAI sends the response.output_audio_transcript.delta event type. When the modality is set to [text], it sends response.output_text.delta instead.

socketConnection.onmessage = (event: MessageEvent) => {
    const data = JSON.parse(event.data);
    // when modality is set to [audio]
    if (data.type === "response.output_audio_transcript.delta") {
        const aiTranscriptText = data.delta;
        // use this text as per your app requirement (you can stream it)
    }

    // when modality set to [text]
    if (data.type === "response.output_text.delta") {
        const aiTranscriptText = data.delta;
        // use this text as per your app requirement (you can stream it)
    }
};

To receive the complete transcript, use either response.output_audio_transcript.done or response.output_text.done based on your modality setting.

socketConnection.onmessage = (event: MessageEvent) => {
    const data = JSON.parse(event.data);
    // when modality is set to [audio]
    if (data.type === "response.output_audio_transcript.done") {
        const aiTranscriptText = data.transcript;
    }

    // when modality set to [text]
    if (data.type === "response.output_text.done") {
        const aiTranscriptText = data.text;
    }
};

Handling User Audio Transcript

To handle the user transcript (if enabled during session configuration), OpenAI sends the conversation.item.input_audio_transcription.delta event type.

socketConnection.onmessage = (event: MessageEvent) => {
    const data = JSON.parse(event.data);
    if (data.type === "conversation.item.input_audio_transcription.delta") {
        const userTranscriptText = data.delta;
    }
};

To receive the complete transcript instead of fragments, use the conversation.item.input_audio_transcription.completed event type.

socketConnection.onmessage = (event: MessageEvent) => {
    const data = JSON.parse(event.data);
    if (data.type === "conversation.item.input_audio_transcription.completed") {
        const userTranscriptText = data.transcript;
    }
};

Handling the Audio Response

In the default [audio] mode, OpenAI sends response.output_audio.delta events with audio chunks in base64 encoded format.

socketConnection.onmessage = (event: MessageEvent) => {
    const data: RealtimeServerEvent = JSON.parse(event.data);
    if (data.type === "response.output_audio.delta") {
        const audioChunkBase64 = data.delta;
        playAudioChunk(audioChunkBase64); // we can play this audio directly using web audio API
    }

     if (data.type === "response.output_audio.done") {
        // audio output done
    }
}

OpenAI has detailed documentation on server events that cover all available event types for various use cases.

Playing the Audio Stream from OpenAI

As shown above, we are receiving a base64 encoded audio stream from OpenAI. We will now show how you can stream it directly using the built in Web Audio API interface AudioContext, which we already used earlier to process microphone audio.

const audioContext = new AudioContext({ sampleRate: 24000 });
if (audioContext.state === "suspended") {
    await audioContext.resume();
}

Once AudioContext is initialized, we can use it to play the received audio chunks.

const playAudioChunk = async (base64Audio: string) => {
    if (!base64Audio) return;

    const float32Array = base64ToFloat32Array(base64Audio);
    if (!audioContext) {
        console.error("AudioContext is not initialized.");
        return;
    }
    const audioBuffer = audioContext.createBuffer(
        1, // mono
        float32Array.length,
        24000, // sample rate
    );

    audioBuffer.getChannelData(0).set(float32Array);

    const source = audioContext.createBufferSource();
    source.buffer = audioBuffer;
    source.connect(audioContext.destination);

    const currentTime = audioContext.currentTime;

    const startTime = Math.max(currentTime, nextStartTime);
    source.start(startTime);
    nextStartTime = startTime + audioBuffer.duration;

    isPlaying = true;

    source.onended = () => {
        if (audioContext.currentTime >= nextStartTime - 0.01) {
            isPlaying = false;
        }
    };
};

In this function, we first decode the base64 encoded audio into a Float32Array, which is the format AudioContext understands. We then create an audio buffer with a mono channel at a sample rate of 24000 Hz (which matches what OpenAI sends), essentially creating an in memory audio clip. We fill the buffer with the audio data, create a one time playback node, and connect it to the speakers via audioContext.destination.

To make playback seamless, we queue chunks back to back without gaps or overlaps. nextStartTime is a variable maintained in the outer scope that tracks when the next audio chunk should begin playing. If a chunk arrives before the current one finishes, it starts exactly when the current one ends. If a chunk arrives late, it starts immediately. When a clip finishes, the playback node fires the onended event. We only mark playback as complete if the current time has caught up to the next start time, meaning no more chunks are queued. The 0.01 value is a small tolerance for floating point timing precision.

To decode the base64 encoded audio into a Float32Array, use the following function.

const base64ToFloat32Array = (base64) => {
    const binaryString = atob(base64);
    const bytes = new Uint8Array(binaryString.length);
    for (let i = 0; i < binaryString.length; i++) {
        bytes[i] = binaryString.charCodeAt(i);
    }

    const dataView = new DataView(bytes.buffer);
    const float32Array = new Float32Array(bytes.length / 2);

    for (let i = 0; i < float32Array.length; i++) {
        const int16 = dataView.getInt16(i * 2, true);
        float32Array[i] = int16 / (int16 < 0 ? 0x8000 : 0x7fff);
    }

    return float32Array;
};

Putting It All Together

We have created a minimal working demo where a user can connect their microphone, speak, and receive a response from the OpenAI Realtime API. When the user clicks the Start Speaking button, the application initializes by creating an ephemeral token, establishing the WebSocket connection, and updating the session configuration. It also initializes the audio processor that captures and processes audio in the background and sends it to OpenAI for buffering.

When the user clicks the Stop Speaking button, the app commits the audio buffer and requests OpenAI to start generating a response in realtime. The demo captures both the user and AI transcripts and displays them, while also playing back the audio response seamlessly.

The full source code also includes error handling for WebSocket server error events sent back by OpenAI, as well as a cleanup routine that stops the microphone stream tracks, closes the AudioContext, and terminates the WebSocket connection gracefully. These are important steps to include in any production application, especially in single page applications where components mount and unmount frequently.

You can view or download the full source code and demo from GitHub repository: OpenAI Realtime WebSocket.

If you have any questions or want to connect, reach me on LinkedIn.