Batch Processing and Callbacks

This guide covers model.predict_batch(), asynchronous callbacks, and how to manage the inference queue.

Batch Processing with predict_batch()

The model.predict_batch() method is an async generator that processes a sequence of images. This is ideal for scenarios like processing frames from a video or handling a large dataset of images efficiently.

You can provide data to predict_batch in two main ways:

  • Async Iterable: Any object that implements the async iteration protocol, such as an array of image sources or a custom generator function.

  • ReadableStream: A standard web API for handling streams of data, perfect for sources like the WebCodecs API.

The method processes images from the source, sends them for inference, and yields the results as they become available. You consume these results using a for await...of loop.

Example 1: Camera Inference Using an Async Generator Function

Here's how you can define a simple async generator to feed webcam frames to predict_batch.

// Create video element and give it access to the webcam
const video = document.createElement('video');
video.autoplay = true;
video.style.display = 'none';
document.body.appendChild(video);

const stream = await navigator.mediaDevices.getUserMedia({ video: true });
video.srcObject = stream;

// Wait for video to be ready
await new Promise(resolve => video.onloadedmetadata = resolve);

// Frame generator yielding camera frames + frameId
async function* frameGenerator() {
    let frameId = 0;
    while (true) {
        if (!video.videoWidth || !video.videoHeight) continue;
        const bitmap = await createImageBitmap(video);
        yield [bitmap, `frame_${frameId++}`];
    }
}

// Run inference on the webcam frames
for await (const result of model.predict_batch(frameGenerator())) {
    model.displayResultToCanvas(result, 'outputCanvas');
}

Example 2: Using an Array as an Async Iterable

Here's how you can process a predefined list of image URLs. We create a simple async generator that yields the image URL and a unique frame identifier.

Example 3: Using a ReadableStream

predict_batch can directly consume a ReadableStream. This is powerful for streaming video frames, for example from a file or a live camera feed using the WebCodecs API.

Please view the WebCodecs Example for a complete demonstration of using WebCodecs and ReadableStream in DeGirumJS.

Asynchronous Flow with Callbacks

Instead of using a for await...of loop to pull results, you can adopt an event-driven approach by providing a callback function when you load the model. When a callback is provided, predict_batch will not yield results. Instead, your callback function will be invoked automatically for each result as it arrives from the server.

This decouples the sending of frames from the receiving of results, which is ideal for real-time applications where you don't want your main loop to be blocked waiting for inference to complete.

When to use which pattern:

  • for await...of (Default): Best for situations where you want to handle results sequentially in a straightforward, linear manner.

  • callback (Event-Driven): Alternative for continuous, real-time streams. It can prevent back-pressure on your UI thread and allows your application to remain responsive.

Example: Using a Callback

Controlling Back-pressure with max_q_len

When you send frames for inference, they are placed in a queue. max_q_len (maximum queue length) is an option you can set during model loading that defines the maximum number of frames that can be "in flight" at once.

  • max_q_len (default: 10 for AI Server, 80 for Cloud Server): The size of the internal queues (infoQ and resultQ) that buffer frames and their results.

This parameter is crucial for managing system resources and preventing your application from sending data faster than the inference server can handle it. If the queue is full, your predict() or predict_batch() call will pause (asynchronously) until a space becomes available. This is a form of back-pressure that keeps the pipeline stable.

A smaller max_q_len can reduce memory usage but may lower throughput if the network or server has high latency. A larger value can improve throughput by ensuring the server is never idle, but it will consume more memory and increase end-to-end latency for any single frame.

Was this helpful?