Project · Active
Streaming Variational Inference for Large Datasets
Mentors. Rob Zinkov (@zaxtax, Berlin) and Chris Fonnesbeck (@fonnesbeck, Nashville).
Problem. PyMC's ADVI and Pathfinder assume the full dataset fits in memory. Financial tick data, sensor streams, and large panel datasets routinely exceed RAM and break this assumption.
Approach. A
Timeline. May 5 – August 25, 2026. Midterm evaluation July 6–10, final submission August 17–24.
Problem. PyMC's ADVI and Pathfinder assume the full dataset fits in memory. Financial tick data, sensor streams, and large panel datasets routinely exceed RAM and break this assumption.
Approach. A
DataLoader that wraps an arbitrary Python iterable and feeds minibatches into PyMC's existing ELBO scaling machinery through a pm.Data placeholder, driven by a callback-free Trainer, plus streaming-compatible ADVI and Pathfinder loops with online convergence monitoring.Timeline. May 5 – August 25, 2026. Midterm evaluation July 6–10, final submission August 17–24.
Why it matters
The result. ADVI and Pathfinder currently need every row resident, so once a dataset outgrows RAM, Bayesian inference is simply off the table. This removes that limit. On the public Criteo 1 TB benchmark the streaming posterior matches an ordinary in-memory fit coefficient for coefficient, while peak memory stays around 1.5 GB — materializing all 4.4 billion rows in memory would take roughly 490 GB, so the in-memory baseline cannot even start. The gain isn't speed; it's inference on data you otherwise couldn't touch.
The design. The abstraction this needed turned out to be one people already know — PyTorch's
The design. The abstraction this needed turned out to be one people already know — PyTorch's
Dataset, DataLoader, and Trainer — so that is what it is called. total_size is just len(loader), and there is nothing new to learn.Weekly notes
- Week 3 — It was already a DataLoader, and a public-data check on CriteoJune 7, 2026
- Week 2 — Streaming ADVI on 122 GB, and the bug that didn't crashJune 6, 2026
- Week 1 — Where streaming hooks into PyMC's ELBO machineryMay 24, 2026
Short, hand-written notes during the program. A working notebook for the design choices, dead ends, and small wins. Code lands at pymc-devs/pymc; this page collects the prose.