GSoC 2026 · Week 1

Where streaming hooks into PyMC's ELBO machinery

Yicheng Yang · May 2026 · ← All GSoC notes

Variational inference works great until your data doesn't fit in RAM. My summer 2026 GSoC project with NumFOCUS / PyMC is about fixing that, making ADVI and Pathfinder work on datasets larger than memory. Mentors are Rob Zinkov (@zaxtax) and Chris Fonnesbeck (@fonnesbeck).

Week 1 was mostly orientation, but I came out of it with a working prototype and a design draft for the upcoming sync with Rob.

Where the work splits

Minibatch variational inference relies on a simple identity: compute the expected log-likelihood over a random subset of size |B| out of N points, multiply by N/|B|, and you have an unbiased estimate of the full-data log-likelihood term in the ELBO (the prior and entropy terms are computed unscaled). That's how VI scales.

In PyMC 6, those two concerns live in two different files. Random row sampling is in pm.Minibatch at pymc/data.py:121. The N/|B| scaling factor is applied inside MinibatchRandomVariable at pymc/variational/minibatch_rv.py:102-106, when the log-prob is registered:

@_logprob.register(MinibatchRandomVariable)
def minibatch_rv_logprob(op, values, *inputs, **kwargs):
    [value] = values
    rv, *total_size = inputs
    return logp(rv, value, **kwargs) * get_scaling(total_size, value.shape)

That decoupling is why a 12-week project is actually tractable: the N/|B| scaling math is a single line that the streaming side can reuse. StreamingDataset's job is to keep a pytensor.shared buffer fresh between iterations; the current prototype wires this together with pm.CustomDist for the rescaled log-prob.

What I'm sketching

StreamingDataset wraps a pytensor.shared buffer of fixed shape (batch_size, *features). An advance() method pulls the next batch from any Python iterable and writes it into the shared buffer. A fit_callback() lets pm.fit call this between gradient steps, so the buffer is always fresh.

To check that the algorithm is preserved, I ran a synthetic test on N = 50,000 data points. Streaming ADVI with batch_size=256 converges to within 0.003 of in-memory ADVI's posterior mean. That's well inside ADVI's own optimization noise. The streaming version is the same algorithm. Only the data path changes.

Also this week: first pymc-examples PR

Opened my first pymc-examples PR (#882) this week, a PyMC 6 / ArviZ 1 compatibility update for variational_api_quickstart.ipynb, the official VI tutorial notebook. The compat side was four small fixes: pm.callbacks had moved, az.plot_posterior was removed, Approximation.sample() now needs a model context, and total_size wants an int instead of a tuple. On top of that, I migrated the plotting throughout the notebook to the new ArviZ 1 API: az.plot_dist({"NUTS": idata, "ADVI": idata}) dict form for multi-model comparisons, az.plot_trace_dist for trace plots, az.convert_to_dataset for wrapping raw numpy arrays, and pc.add_legend("model") for figure legends.

What I got out of it: ArviZ 1 has a way richer visualization grammar than I'd realized, which I'll be using later when reporting streaming ADVI results. Doing a small contribution like this during the design phase also turned out to be a cheap way to learn the contribution flow before any bigger work lands.


Links. GSoC project page · PyMC source · PR #882 · ADVI paper (Kucukelbir et al. 2017) · Mentors @zaxtax · @fonnesbeck