GSoC 2026

PyMC Streaming Variational Inference

A 12-week Google Summer of Code project with NumFOCUS / PyMC. Extending PyMC's variational inference (ADVI and Pathfinder) to handle datasets that don't fit in memory.

Project · Active
Streaming Variational Inference for Large Datasets
Mentors. Rob Zinkov (@zaxtax, Berlin) and Chris Fonnesbeck (@fonnesbeck, Nashville).

Problem. PyMC's ADVI and Pathfinder assume the full dataset fits in memory. Financial tick data, sensor streams, and large panel datasets routinely exceed RAM and break this assumption.

Approach. A DataLoader that wraps an arbitrary Python iterable and feeds minibatches into PyMC's existing ELBO scaling machinery through a pm.Data placeholder, driven by a callback-free Trainer, plus streaming-compatible ADVI and Pathfinder loops with online convergence monitoring.

Timeline. May 5 – August 25, 2026. Midterm evaluation July 6–10, final submission August 17–24.
NumFOCUS / PyMC · GSoC project page · PyMC repo
Why it matters
The result. ADVI and Pathfinder currently need every row resident, so once a dataset outgrows RAM, Bayesian inference is simply off the table. This removes that limit. On the public Criteo 1 TB benchmark the streaming posterior matches an ordinary in-memory fit coefficient for coefficient, while peak memory stays around 1.5 GB — materializing all 4.4 billion rows in memory would take roughly 490 GB, so the in-memory baseline cannot even start. The gain isn't speed; it's inference on data you otherwise couldn't touch.

The design. The abstraction this needed turned out to be one people already know — PyTorch's Dataset, DataLoader, and Trainer — so that is what it is called. total_size is just len(loader), and there is nothing new to learn.

Short, hand-written notes during the program. A working notebook for the design choices, dead ends, and small wins. Code lands at pymc-devs/pymc; this page collects the prose.