Variational inference works great until your data doesn't fit in RAM. My summer 2026 GSoC project with NumFOCUS / PyMC is about fixing that, making ADVI and Pathfinder work on datasets larger than memory. Mentors are Rob Zinkov (@zaxtax) and Chris Fonnesbeck (@fonnesbeck).
Week 1 was mostly orientation, but I came out of it with a working prototype and a design draft for the upcoming sync with Rob.
Where the work splits
Minibatch variational inference relies on a simple identity: compute
the expected log-likelihood over a random subset of size
|B| out of N points, multiply by
N/|B|, and you have an unbiased estimate of the
full-data log-likelihood term in the ELBO (the prior and entropy
terms are computed unscaled). That's how VI scales.
In PyMC 6, those two concerns live in two different files. Random row
sampling is in pm.Minibatch at
pymc/data.py:121. The N/|B| scaling factor
is applied inside MinibatchRandomVariable at
pymc/variational/minibatch_rv.py:102-106, when the
log-prob is registered:
@_logprob.register(MinibatchRandomVariable)
def minibatch_rv_logprob(op, values, *inputs, **kwargs):
[value] = values
rv, *total_size = inputs
return logp(rv, value, **kwargs) * get_scaling(total_size, value.shape)
That decoupling is why a 12-week project is actually tractable: the
N/|B| scaling math is a single line that the streaming
side can reuse. StreamingDataset's job is to keep a
pytensor.shared buffer fresh between iterations; the
current prototype wires this together with pm.CustomDist
for the rescaled log-prob.
What I'm sketching
StreamingDataset wraps a pytensor.shared
buffer of fixed shape (batch_size, *features). An
advance() method pulls the next batch from any Python
iterable and writes it into the shared buffer. A
fit_callback() lets pm.fit call this between
gradient steps, so the buffer is always fresh.
To check that the algorithm is preserved, I ran a synthetic test on N = 50,000 data points. Streaming ADVI with batch_size=256 converges to within 0.003 of in-memory ADVI's posterior mean. That's well inside ADVI's own optimization noise. The streaming version is the same algorithm. Only the data path changes.
Also this week: first pymc-examples PR
Opened my first pymc-examples PR
(#882)
this week, a PyMC 6 / ArviZ 1 compatibility update for
variational_api_quickstart.ipynb, the official VI
tutorial notebook. The compat side was four small fixes:
pm.callbacks had moved,
az.plot_posterior was removed,
Approximation.sample() now needs a model context, and
total_size wants an int instead of a tuple. On top of
that, I migrated the plotting throughout the notebook to the new
ArviZ 1 API: az.plot_dist({"NUTS": idata, "ADVI": idata})
dict form for multi-model comparisons,
az.plot_trace_dist for trace plots,
az.convert_to_dataset for wrapping raw numpy arrays, and
pc.add_legend("model") for figure legends.
What I got out of it: ArviZ 1 has a way richer visualization grammar than I'd realized, which I'll be using later when reporting streaming ADVI results. Doing a small contribution like this during the design phase also turned out to be a cheap way to learn the contribution flow before any bigger work lands.
Links. GSoC project page · PyMC source · PR #882 · ADVI paper (Kucukelbir et al. 2017) · Mentors @zaxtax · @fonnesbeck