GSoC 2026 · Week 2

Streaming ADVI on 122 GB, and the bug that didn't crash

Yicheng Yang · June 2026 · ← All GSoC notes

This week the goal is to let minibatch ADVI stream data from disk, so memory usage does not grow with the number of rows. This matters because pm.Minibatch (at least in the PyMC version I'm building against) still expects the full array to be in memory — which is exactly the case where minibatching is supposed to help but often can't.

Testing it on 122 GB of order-book data

Once the interface was mostly working, I wanted to test it on something real. I happened to have a large Kalshi L2 order book dataset I'd collected for a separate market-microstructure project, and it was a good stress test because it does not fit comfortably in memory. The raw data is 122.67 GB — about 57 days from 2026-04-07 to 2026-06-04, across 2,685 files. I folded the raw book_delta events back into book states and turned each one into a small feature row (mid, spread, imbalance, and the next-move label), which gave me about 121.8 million rows. Those rows are tiny — about 17 bytes each — so the training table itself is only a couple of GB, even though the raw data is huge.

For the test model, I used a simple logistic regression:

logit P(next mid-price move is up)
= b0 + b1·mid + b2·spread + b3·imbalance

and fit it with ADVI using minibatches of size 4096.

The posteriors drifted — and nothing crashed

I expected the streaming version and the in-RAM baseline to give basically the same posterior. They did not. As I increased N, the posterior kept drifting farther from the in-RAM baseline: max|Δ| was about 0.045 at 1M rows, but grew to 1.65 by 26M rows.

The worst part was that nothing crashed. There was no error, no warning, and no obvious sign that anything was wrong. It just silently gave me the wrong posterior, which is much more dangerous than a loud failure.

It was the shuffle story

After digging into it, the issue turned out not to be the math in ADVI or a bug in reading files. It was the shuffle story. The order book data is naturally sorted by time. A bounded shuffle buffer can only do local, block-like shuffling: rows are random inside the window, but the window itself still moves forward through time. So the minibatches stop being a fair sample of the whole dataset — each one is really just a single slice of time. The gradients become biased, and ADVI converges to a different posterior.

A simple way to think about it is a deck of cards sorted by suit. If you only shuffle the top few cards before dealing, the hands will still come out in big suit clusters.

The important design lesson is that a bounded buffer cannot magically fix globally ordered data unless we do an O(N) full-data shuffle. But if we force that inside StreamingDataset, it is no longer really single-pass streaming — it becomes an offline step you run once up front. So this is not really a bug; it is an API contract.

The fix: pre-shuffle once

The fix was to pre-shuffle the whole table once on disk. Since the rows are so small, doing it offline was cheap and I only had to do it once. After that, sequential streaming behaved the way I expected. At 20M rows, max|Δ| dropped from 0.69 to 0.027, and over the full run it stayed under 0.07. At 122M rows, the difference was only 0.022.

Bar chart of the four posterior coefficient means with standard-deviation error bars, in-RAM minibatch versus StreamingDataset, at N = 121,818,328; the bars nearly coincide with max absolute difference 0.022.
After the pre-shuffle, the streaming and in-RAM posteriors agree at the full 121.8M rows (max|Δ| = 0.022); the small residual is just the difference between two independent stochastic ADVI runs, not a bias.

Why it matters: memory stays flat

This also showed the main benefit. Streaming stayed almost flat — around 0.5 GB, still only 0.57 GB at 122M rows — while the in-RAM baseline climbed straight to 9.1 GB (it keeps every row in memory as float64, so it is much bigger than the 17-byte rows on disk), about 16x higher. From that measured slope, a 24 GiB machine (about 26 GB) would run out of memory around 372M rows.

Log-log plot of peak resident memory versus dataset size N from 0.5M to 122M rows; the in-RAM minibatch baseline rises linearly to 9.1 GB and extrapolates to an out-of-memory point near 372M rows, while StreamingDataset stays flat near 0.5 GB.
Peak resident memory vs N (9 measured points). In-RAM minibatch rises linearly to 9.1 GB at 122M rows; the dotted extrapolation hits the machine's memory ceiling near 372M rows. StreamingDataset stays flat at ~0.5 GB.

What went into the PR

For the PR, I added a bounded shuffle_buffer option for normal streaming use, but the docstring now makes the contract explicit: if the data has a meaningful global order, the user needs to pre-shuffle it first. I also added total_size="auto", so finite datasets can automatically infer N from metadata, for example from Parquet files, while truly infinite streams can still provide it manually.

from pymc.variational.streaming import (
    StreamingDataset,
    shuffle_buffer,
    parquet_source,
)

# Pre-shuffle once on disk, then stream shuffled minibatches:
ds = StreamingDataset(
    shuffle_buffer(
        parquet_source("shuffled/"),
        buffer_size=1_000_000,
        batch_size=4096,
        seed=0,
    ),
    batch_size=4096,
    sample_shape=(4,),     # 3 features + 1 observed column
    total_size="auto",     # infer N from Parquet metadata
)

with pm.Model():
    b = pm.Normal("b", 0.0, 3.0, shape=4)
    buf = ds.as_tensor()
    logit = (
        b[0]
        + b[1] * buf[:, 0]
        + b[2] * buf[:, 1]
        + b[3] * buf[:, 2]
    )
    pm.Bernoulli(
        "up",
        logit_p=logit,
        observed=buf[:, 3],
        total_size=ds.total_size,
    )
    approx = pm.fit(20_000, method="advi", callbacks=[ds.fit_callback()])

The main takeaway for me is that streaming VI needs a real shuffle story. A buffer alone is not enough when the data is ordered.

Next, I'm going to finalize the API with my mentors, Rob and Chris, and then open the upstream PR.


Links. GSoC project page · PyMC source · ADVI paper (Kucukelbir et al. 2017) · Mentors @zaxtax · @fonnesbeck