GSoC — Yicheng Yang

Project · Active

Streaming Variational Inference for Large Datasets

Mentors. Chris Fonnesbeck (@fonnesbeck, Nashville) and Rob Zinkov (@zaxtax, Berlin).

Problem. PyMC's ADVI and Pathfinder assume the full dataset fits in memory. Financial tick data, sensor streams, and large panel datasets routinely exceed RAM and break this assumption.

Approach. A DataLoader that wraps an arbitrary Python iterable and feeds minibatches into PyMC's existing ELBO scaling machinery through a pm.Data placeholder, driven by a callback-free Trainer, plus streaming-compatible ADVI and Pathfinder loops with online convergence monitoring.

Timeline. May 5 – August 25, 2026. Midterm evaluation July 6–10, final submission August 17–24.

NumFOCUS / PyMC · GSoC project page · PyMC repo

Why it matters

The result. ADVI and Pathfinder currently need every row resident, so once a dataset outgrows RAM, Bayesian inference is simply off the table. This removes that limit. On the public Criteo 1 TB benchmark the streaming posterior matches an ordinary in-memory fit coefficient for coefficient, while peak memory stays around 1.5 GB — materializing all 4.4 billion rows in memory would take roughly 490 GB, so the in-memory baseline cannot even start. The gain isn't speed; it's inference on data you otherwise couldn't touch.

The design. The abstraction this needed turned out to be one people already know — PyTorch's Dataset, DataLoader, and Trainer — so that is what it is called. total_size is just len(loader), and there is nothing new to learn.

Weekly notes

Short, hand-written notes during the program. A working notebook for the design choices, dead ends, and small wins. Code lands at pymc-devs/pymc-extras; this page collects the prose.