Home

Deep Dive: Inference Pipeline for Self-Play

Deep Dive: Inference Pipeline for Self-Play

In a previous post, I described the AlphaZero-style AI behind the Wall Game (wallgame.io) at the architectural level: Monte Carlo Tree Search (MCTS), the neural network, training via self-play, inference, and deployment.

This post zooms in on the inference piece.

We'll trace the path of a single inference request, from the moment an MCTS coroutine needs a position evaluated to the moment the GPU result arrives back. We'll touch on:

  • C++20 coroutines for non-blocking concurrency,
  • a lock-free queue for the CPU-GPU handoff,
  • pinned memory for fast PCIe transfers,
  • greedy batching,
  • pipelined GPU workers to hide synchronization stalls,
  • a sharded LRU cache to skip redundant evaluations, and
  • TensorRT for optimized GPU inference.

The inference pipeline is shared between training (self-play) and production (website users playing vs the AI). During self-play, requests from hundreds of parallel games are batched together; in production, requests from concurrent games with different users are also batched.

During self-play, this pipeline sustains over 1 billion inferences per hour on a single RTX5090, hovering 100% utilization.

Credits to Thorben for implementing the AlphaZero-style AI for the Wall Game (repo), including the inference pipeline described in this post.

The fork in my monorepo adds the server integration so users can play online.

Prerequisites and Concepts

Coroutines

Our batching model relies on C++20 coroutines: functions that can suspend partway through and resume later, potentially on a different thread.

Unlike regular functions, whose stack frames live on a thread's call stack (and are therefore tied to that thread), a coroutine's state is allocated on the heap. This means no thread needs to "hold onto" a suspended coroutine - any thread can resume it later.

A coroutine looks mostly like a normal function, but it has access to a special keyword:

  • co_await expr: suspends the coroutine until expr produces a result. The coroutine's state (local variables, where it left off) is saved to the heap, and the thread it was running on is freed to do other work. When the result is ready, the coroutine resumes right after the co_await.

We use Folly (Meta's C++ foundation library) for the coroutine runtime. Coroutines return folly::coro::Task<T>, which wraps the heap-allocated state and integrates with a CPUThreadPoolExecutor (thread pool): when a coroutine is ready to run or resume, it gets posted to the pool's work queue.

Coroutines in MCTS

When an AlphaZero-style AI needs to make a move, it iteratively grows and refines an MCTS tree: a tree of potential future positions with tentative evaluations.

The main loop consists of collecting MCTS samples: each sample involves traversing the tree down to a leaf (which corresponds to a yet unevaluated position), running a model inference to evaluate the position at that leaf, and backpropagating the result up the tree, refining the evaluation of the intermediate positions.

The tree traversals and updates are computationally cheap relative to the model inference (which, in our case, is a 20-block ResNet with about 2.3M parameters) so each sample can be encapsulated as a coroutine that suspends (via co_await) at the inference step.

Thanks to coroutines, a small pool of 10-20 CPU threads is enough to keep hundreds of MCTS samples in flight and produce enough inference requests to sustain ~100% utilization on a GPU such as an RTX5090. Each suspended coroutine only consumes a bit of heap memory.

Promises and Futures

We need a way for a coroutine to request GPU work and get the result back without blocking a thread. For this, we use Folly's promises and futures (folly::Promise<T>, folly::SemiFuture<T>).

A Promise and its corresponding SemiFuture are like a single-use queue for passing a result between a producer thread and a consumer thread:

  • the consumer calls co_await on the SemiFuture, suspending until the result arrives;
  • the producer writes a result to the Promise via setValue().

Under the hood, co_await registers a callback so that when setValue() is called, the coroutine's resumption is marked ready.

TensorRT

TensorRT is NVIDIA's inference optimizer. It takes a trained neural network (exported from PyTorch as ONNX) and compiles it into a GPU-optimized execution plan: fusing layers, choosing the fastest kernel implementations for the specific GPU, and optimizing memory layout. We use TensorRT rather than running PyTorch directly because inference latency matters - TensorRT is typically 2-5x faster for our model size.

CUDA is NVIDIA's GPU programming toolkit. We don't write custom CUDA kernels - TensorRT handles that - but we use CUDA's pinned memory to move data between CPU and GPU efficiently. More on that later.

Pipeline Overview

The inference pipeline has two sides: a thread pool running MCTS samples (10-20 threads, hundreds of concurrent coroutines), and one or two dedicated GPU worker threads that batch their requests and run TensorRT.

A lock-free MPMCQueue (multi-producer multi-consumer queue) bridges the two - all coroutines enqueue to the same queue, and all GPU workers drain from it.

A coroutine needing an evaluation drops a request in the queue and suspends; a GPU worker drains requests into a batch, runs the GPU, and fulfills each Promise, which resumes the waiting coroutines back on the thread pool.

Training vs production

In production, we currently serve 3 different models for different game variants. To keep things simple, each model runs in a separate process with its own thread pool, queue, and GPU worker. However, having a single thread pool for all models would use the threads more efficiently.

Both training and production construct the same four-layer policy stack described below (CachedPolicyBatchedModelPolicyBatchedModelTensorRTModel). The differences are in how the pipeline is set up and what happens with the results:

TrainingProduction
GPU workers2 (pipelined)1 (per model)
Concurrent gamesHundreds of self-play gamesBased on active users
What happens with resultsMCTS evaluations become training data (exported to CSV)MCTS picks the best move and sends it to the game server
Main threadBlocks until all self-play games finish, then exits so the training script (PyTorch) can pick up the exported dataRuns an event loop, reading JSON game requests from stdin and dispatching them to the session manager

Step 1: The Coroutine Suspends

Everything starts in the MCTS search tree. When a sample reaches an unexplored leaf node, it needs the neural network to evaluate the position:

// mcts.cpp
Evaluation eval = co_await m_evaluate(board, turn, previous_position);

The m_evaluate function is a coroutine-returning callable. In both training and production, it's a stack of four layers:

  1. CachedPolicy - Check evaluation LRU cache; on miss, forward to:
  2. BatchedModelPolicy - Convert board to tensor, then:
  3. BatchedModel - Queue the request, return SemiFuture (coroutine suspends here)
  4. TensorRTModel - Run the GPU kernel (GPU worker thread)

Cache Layer

MCTS frequently revisits the same positions. Parallel samples from the same game can reach the same leaf, and across hundreds of parallel self-play games, common openings repeat constantly. Without caching, these would all become redundant GPU evaluations.

The CachedPolicy uses sharded folly::EvictingCacheMap instances to avoid this. Each position is hashed to one of N shards (roughly matching the thread pool size), each with its own lock, so threads accessing different shards never contend.

// cached_policy.cpp
folly::coro::Task<Evaluation> CachedPolicy::operator()(
    Board const& board, Turn turn,
    std::optional<PreviousPosition> previous_position) {

    CacheEntryView ce_view{board, turn, previous_position};
    auto hash = folly::HeterogeneousAccessHash<CacheEntry>{}(ce_view);
    auto& lru = m_cache->lrus[hash % m_cache->lrus.size()];

    {
        auto locked_lru = lru.wlock();
        auto existing_entry = locked_lru->find(ce_view);
        if (existing_entry != locked_lru->end()) {
            ++m_cache->cache_hits;
            co_return existing_entry->second;  // HIT: skip the GPU entirely
        }
    }

    Evaluation eval = co_await m_cache->evaluate(board, turn, previous_position);
    lru.wlock()->insert(CacheEntry{board, turn, previous_position}, eval);
    ++m_cache->cache_misses;
    co_return eval;
}

On a cache miss, we fall through to BatchedModelPolicy.

Board-to-Tensor Conversion

BatchedModelPolicy converts the board into the neural network's input format - 9 float planes (described in the AI post) - and then calls BatchedModel::inference():

// batched_model_policy.cpp
folly::coro::Task<Evaluation> BatchedModelPolicy::operator()(
    Board const& board, Turn turn, std::optional<PreviousPosition> previous_position) {

    auto state = convert_to_model_input(board, turn, m_model->channels());
    auto inference_result = co_await m_model->inference(std::move(state));
    // ... (post-processing below)

The Handoff: Into the Queue

Here's where the coroutine actually suspends.

BatchedModel::inference() creates an InferenceTask, which bundles the input state with a promise, enqueues the task, and returns the SemiFuture for that promise. It is not a coroutine - it's a regular function that hands back the SemiFuture so the caller can wait.

// batched_model.hpp
struct InferenceTask {
    std::vector<float> state;
    folly::Promise<ModelOutput> output;
};

folly::SemiFuture<ModelOutput> BatchedModel::inference(std::vector<float> state) {
    InferenceTask task{std::move(state), {}};
    auto result = task.output.getSemiFuture();

    m_tasks.blockingWrite(std::move(task));

    return result;
}

The Promise moves into the m_tasks queue (heading for the GPU worker), while the SemiFuture stays with the coroutine.

Back in BatchedModelPolicy, the co_await on the returned SemiFuture suspends the coroutine and frees the thread.

Step 2: The GPU Worker Batches and Executes

The InferenceTask is now sitting in m_tasks, a lock-free queue (folly::MPMCQueue<InferenceTask>) bounded to 4,096 slots. The lock-free design means concurrent coroutines can enqueue requests without blocking each other. The bounded capacity provides natural backpressure: m_tasks.blockingWrite() only blocks if the queue is full, preventing unbounded memory growth.

On the other side of the queue, a dedicated GPU worker thread (or two for pipelining) runs in a loop. The idx parameter identifies the GPU worker; if there are two, each one owns a TensorRT execution context, but they share the same queue:

// batched_model.cpp
void BatchedModel::run_worker(std::size_t idx) {
    std::vector<folly::Promise<ModelOutput>> dequeued_promises;

    PinnedBuffer<float> states(m_models[idx]->batch_size() * m_models[idx]->state_size());
    PinnedBuffer<float> priors(m_models[idx]->batch_size() * m_models[idx]->prior_size());
    PinnedBuffer<float> values(m_models[idx]->batch_size());

    while (true) {
        for (int i = 0; i < m_models[idx]->batch_size(); ++i) {
            InferenceTask task;

            if (i == 0) {
                // Block until at least one arrives
                m_tasks.blockingRead(task);
            } else if (!m_tasks.read(task)) {
                // Non-blocking: take what's available
                break;
            }

            if (task.state.empty()) {
                // Sentinel: shutdown signal
                return;
            }

            std::ranges::copy(task.state, states.data() + m_models[idx]->state_size() * i);
            dequeued_promises.push_back(std::move(task.output));
        }

        m_models[idx]->inference(states, {priors, values}); // GPU kernel

        for (std::size_t i = 0; i < dequeued_promises.size(); ++i) {
            std::vector<float> prior{
                priors.data() + m_models[idx]->prior_size() * i,
                priors.data() + m_models[idx]->prior_size() * (i + 1)};

            dequeued_promises[i].setValue(ModelOutput{std::move(prior), values[i]});
        }

        dequeued_promises.clear();
    }
}

Let's break it down.

Greedy Batching

The batch size is the maximum number of positions sent to the GPU at once. If it is too low, the GPU is underutilized - each batch pays the same fixed overhead (PCIe transfers, kernel launch) regardless of size, and the GPU's cores don't have enough work to stay busy. In our case, a batch size of 256 is enough to reach near-full GPU utilization.

The GPU worker thread blocks until at least one task can be read from the queue (blockingRead). Once one arrives, the worker greedily tries to grab more via non-blocking read() calls, up to 256. This means the actual batch size is dynamic: if 200 requests are queued, it runs a batch of 200; if only 3 are available, it runs a batch of 3. This is to improve latency and avoid artificial delays, even though throughput may suffer a bit.

Pinned Memory

The PinnedBuffer<float> arrays are allocated with cudaMallocHost, which pins the memory in physical RAM so the OS can't page it out to disk. Normal (pageable) memory requires CUDA to first copy data to an internal staging buffer before transferring it to the GPU via DMA (Direct Memory Access) over the PCIe bus - an extra copy on every batch. Pinned memory eliminates this step. The buffers are allocated once when the worker starts and reused across batches.

// cuda_wrappers.hpp
PinnedBuffer(std::size_t size) : m_size{size} {
    cuda_check(cudaMallocHost(&m_data, size * sizeof(T)));

The GPU Kernel

m_models[idx]->inference(states, {priors, values}) calls through to the TensorRT model:

// tensorrt_model.cpp
void TensorRTModel::inference(std::span<float> states, Output const& out) {
    m_states.to_device(states, m_stream);   // Async host → device transfer
    m_context->enqueueV3(m_stream.get());   // Queue the GPU kernel
    m_priors.to_host(out.priors, m_stream); // Async device → host transfer
    m_values.to_host(out.values, m_stream); // Async device → host transfer
    m_stream.synchronize();                 // Wait for everything to complete
}

All four operations are enqueued on the same CUDA stream - a sequence of GPU operations that execute in order. The CPU-side calls return almost instantly (they just queue work). Only synchronize() blocks the worker thread, waiting for the entire upload-compute-download sequence to finish. Meanwhile, the MCTS coroutines continue running on the CPU thread pool - only the GPU worker thread is blocked.

Pipelining

While a GPU worker is blocked on synchronize(), it can't drain the queue or prepare the next batch. During self-play, where there's always more positions to evaluate, this matters.

The fix is to run 2 GPU worker threads, each with its own TensorRT execution context (own CUDA stream, own I/O buffers, but sharing the same model weights in GPU memory). Both workers drain from the same queue. While worker A is blocked waiting for the GPU to finish batch N, worker B can be copying batch N+1's data into its pinned buffers and uploading it to the GPU. The CUDA driver interleaves operations from the two streams.

In production, a single worker per model is sufficient - live games generate far fewer inference requests than self-play, so the idle time during synchronize() doesn't matter.

Step 3: The Coroutine Resumes

After the GPU returns, the worker fulfills each promise with its corresponding slice of the output batch:

// batched_model.cpp
dequeued_promises[i].setValue(ModelOutput{std::move(prior), values[i]});

The setValue() call fires the callback that co_await registered (as described in Promises and Futures). This posts the coroutine's resumption to the thread pool's work queue.

Post-Processing

Back in BatchedModelPolicy, the coroutine resumes with the raw model output and converts it into something MCTS can use (the model outputs probabilities for all moves (legal and illegal), so BatchedModelPolicy filters them to only legal moves and renormalizes the priors so they sum to 1).

The result flows back through the cache layer (which stores it for future lookups), and finally arrives at create_tree_node in mcts.cpp, where the new tree node is created with the model's evaluation and priors.

Parallel Samples

Building out an MCTS tree from a given position requires many samples (e.g., 1000), each of which produces one inference request.

To fill GPU batches, we need many concurrent requests. At any given moment, many coroutines should be suspended waiting for GPU results, which is what fills the inference batch.

These coroutines can happen across parallel games, or they can happen within a game, collaborating to grow out the MCTS tree faster.

The MCTS tree data structure is designed to support multiple threads updating it at the same time, with node-level concurrency control. However, having many parallel samples within a game is not ideal because they are not independent - each one refines the tree, so later samples benefit from earlier ones. In addition, too many in parallel leads to redundant inferences exploring the same leaves (though we use a mechanism known as 'virtual loss' to mitigate this).

In the extreme, if all 1000 samples start at the same time, then we won't reach any node at depth more than 1, because we can't reach a node at depth 2 until the sample for its parent has finished.

During self-play, we can spawn many games in parallel, so we can keep per-game parallelism low. In production, we can use parallel samples to grow one MCTS faster in the same amount of time, making the AI stronger even if individual samples sometimes interfere.

The samples to evaluate the current position are launched like this:

// mcts.cpp

    // Create `num_samples` coroutines (lazily - nothing runs until collectAllWindowed)
    auto sample_tasks = views::iota(0, num_samples) |
        views::transform([&](int) { return single_sample().scheduleOn(executor); });
    co_await folly::coro::collectAllWindowed(sample_tasks, max_parallelism);

collectAllWindowed runs all 1000 coroutines, but only allows max_parallelism (e.g., 32) to be active at a time so they don't step on each other. Note that this is independent of whether there are threads available in the thread pool or not.

Final Thoughts

Inference as a topic is only becoming more relevant: it's at inference time when AI companies actually make money.

Understanding the inference pipeline for our ResNet-style model is a great starting point. It introduces many important concepts and optimizations: batching, pinned memory, pipelining, etc. - see the feature list at the top of the post.

However, LLMs are the most prominent model type now, and LLM inference is another beast:

  • Our model only has 2.3M parameters, while LLMs can have many billions; models may not even fit in a single GPU;
  • LLM output lengths are unpredictable, making batching not work as neatly (you need continuous batching instead);
  • You have to deal with the KV Cache memory bottleneck.

Training the Wall Game AI with this inference pipeline on a consumer GPU was a total success: it became superhuman in a couple of days.

In contrast, I found that trying to run local LLMs on that very same hardware is painful. The models that fit in it are underwhelming, and inference is slow.

The experiment left me with a deep appreciation for the speed and volume at which the frontier AI labs are capable of running inference with much larger models. Some really interesting engineering work must be happening there.


Want to leave a comment? You can post under the linkedin post or the X post.

    Deep Dive: Inference Pipeline for Self-Play