Incredibly excited to be part of Gunnar Rätsch’s group in Zürich (while still being based at and involved with Gatsby). Part of my time will oficially go into Shogun development.

# Costa Blanca

# Shogun now supports Intel MKL

Over the last few weeks and months, a few things came together that make Shogun both a lot easier to install, and a lot faster!

**EDIT**: While I was writing this post, Viktor leaked some of the results. I should work faster 😉

basic runtime performance comparison of various models on benchmark dataset of the new @ShogunToolbox release vs @scikit_learn #sklearn pic.twitter.com/FJeANfmo0e

— Viktor Gal (@vikgl) December 1, 2017

## Easier installation: conda integration & windows

Thanks to Dougal, who did an awesome job of integrating shogun with conda, installing Shogun is now as easy as

conda install -c conda-forge shogun

Viktor recently made this work under windows as well (not easy! yet only C++ interface, but this will change soon). Check his StackOverflow post if you want to give it a try. After the years and years of cryptic installation procedures these things hopefully make Shogun more accessible for new users. Thanks again Dougal!

## Faster Shogun: Lapack, Eigen3 and Intel MKL

How did we make Shogun faster? Let’s take a little peak under the hood!

For fast Machine Learning algorithms, we need well-tuned implementations for linear algebra operations. One commonly used set of tools is LAPACK/BLAS: the BLAS standard is set of low-level routines for performing basic linear algebra operations is; LAPACK is a set of routines for slightly more complex operations (e.g. matrix factorisations) on top of BLAS.

If we had an Intel CPU, we could use Intel’s math kernel library (MKL), a LAPACK/BLAS suite optimised for Intel CPUs. (for an example, check out the benchmarks of MKL-anaconda vs standard anaconda). However, since it is proprietory software it used to be hard to get a copy without having to pay. So when Anaconda recently started shipping a free version of MKL with their Python distribution, Viktor got to work to harness MKL for Shogun.

While historically using LAPACK/BLAS in many places (mostly with openblas), the recent Shogun has a flexible linear algebra backend which heavily uses Eigen3, a header-only, template based C++ implementation of a lot of linear algebra. Eigen3 claims to be at least as fast as most free and non-free LAPACK/BLAS suites. BUT Eigen3 lacks parallel implementations of many matrix operations. This is crucial for many ML algorithms. On the other hand, MKL has those parallel implementations, so we want to use them.

How does it work? Luckily, it is possible to compile Eigen3 code against MKL. Then MKL acts as a drop-in replacement within Shogun’s Eigen3/openblas backend. Long story short, let’s compare Shogun’s algorithms with Eigen3+openblas against the Eigen3+MKL version. Good news aside: Viktor recently set up a proper LAPACK detection in our cmake setup which makes everything work out of the box.

On a side note: I actually first found out about this when writing a paper on fast MMD implementations, where we compared an Eigendecomposition approach (MKL has multi-core versions!) with our own codes. Though we managed to beat it 🙂

## Compilig Shogun using Docker

To compare the performance of Shogun using MKL vs Eigen3/openblas, we need to have a Shogun version that links against each of them. The easiest way to get this into place — in a way that anyone reading this post could reproduce — is using a Docker container. If you install Shogun using conda (see above), the openblas version is downloaded, so in this case we want to compile from scratch.

I start from the official anaconda image, which currently is a Debian jessie. I download the image, fire up a container with it, and finally start a bash (make sure to read up on containers vs images).

sudo docker pull continuumio/anaconda3 sudo docker run -i -t continuumio/anaconda3 /bin/bash

## Installing dependencies

I want to compile Shogun, so I need a compiler and the C++ library (which are not part of the image). I also use a compiler cache that speed’s up compiling Shogun.

apt-get install -qq --force-yes --no-install-recommends make gcc g++ libc6-dev ccache

Next, since I want to use Shogun from Python, I need swig to generate bindings to Shogun’s C++ core. Unfortunately, the current swig version in Debian jessie is too old (3.0.2) for Shogun, which needs at least 3.0.5. The same is true for cmake. But using conda makes updating those straight-forward:

conda install swig cmake

Ok, we need one more thing: anaconda comes with its shiny new MKL and Shogun’s Eigen3 will be compiled against it. The compiler therefore needs the MKL header files:

conda install mkl-include

## Using Shogun without MKL (optional)

If you wanted to use Shogun’s non-MKL version, you could just install a precompiled binary version of Shogun using conda. If you want to however, compare the manually compiled versions with this installation, you would need to make conda forget about MKL (which installs openblas instead). This causes all MKL optimised packages to be re-installed (numpy, sklearn, etc). In addition, the blas header files are needed.

conda install -c anaconda nomkl apt-get install libopenblas-dev

Most people will skip this step.

## Compiling the source code

Let’s download Shogun’s latest source code (development version after our new 6.1.2 release).

cd /opt/ git clone https://github.com/shogun-toolbox/shogun.git

Let’s configure the beast. There is some options I set here: disable GPL codes & examples (which take time to compile) and disable xml serialization (which has some funny errors in this setup). More importantly, I set the (install-)prefix to the conda distribution of the anaconda image.

cd shogun mkdir build cd build cmake .. -DINTERFACE_PYTHON=On -DLICENSE_GPL_SHOGUN=Off -DUSE_SVMLIGHT=Off -DBUILD_META_EXAMPLES=Off -DBUILD_EXAMPLES=Off -DENABLE_LIBXML2=Off -DCMAKE_PREFIX_PATH=/opt/conda -DCMAKE_INSTALL_PREFIX=/opt/conda

Compile and install

make -j 4 make install

Let’s check that Shogun and its Python bindings do reference to either MKL or openblas. You can do that with

ldd /opt/conda/lib/libshogun.so | grep 'mkl\|blas' ldd /opt/conda/lib/python3.6/site-packages/_shogun.so | grep 'mkl\|blas'

For the procedure I outlined in this post, you should see something like

libmkl_rt.so => not found

Nevermind the “not found”, which is related to a broken ld setup in the anaconda image. Shogun sorts this out for you. The point is that there is either MKL **or** openblas. If you removed all the MKL packages first and installed openblas instead, it should be in the lines of

libopenblas.so.0 => /usr/lib/libopenblas.so.0 (0x00007f38e6eac000)

## Comparing runtimes

I use a very simple code snippet to compare runtime of two Shogun algorithms: linear regression and PCA, both on random data, see below. Both of them are based on a matrix factorisation, where the multi-threaded MKL implementation can shine.

Here are the walltimes (from a single run). I have a X1 Carbon Thinkpad with an Intel i7-7500U CPU, which has 2 cores and 4 threads.

Openblas | MKL | |
---|---|---|

Linear regression | 7.61 s | 2.09 s |

PCA | 23 s | 12.6 s |

Pretty epic difference, especially given that this comes essentially for free. When running the benchmark and monitoring my CPU, I was surprised to see that openblas actually uses **all four** system threads, while MKL only uses **two** (it prefers it that way) That is what I call efficient!

It is also very interesting that in Viktor’s tweet above, Shogun with MKL can be quite a bit faster than sklearn. There is a lot of things to be benchmarked here: for example, in contrast to sklearn, our SVM solvers are accelerated through MKL as well, as we ported the code to using our linear algebra backend.

## Conclusions

BLAS/LAPACK is a complicated topic! One take-away for me is that it is worth reading a bit about those things, as they do make a big difference.

A next step is to benchmark everything properly, using the benchmark framework by Marcus and Ryan from MLPack. In particular, I am curious how Shogun+MKL will then do compared to other ML libraries.

We should probably also make Shogun’s binary distributions (at least the one on conda) include an MKL build by default. For that, Shogun would have to move to the conda default channel, as conda-forge cannot have MKL. And for that, we need a BSD compatible release (currently Shogun is licensed under the viral GPL), which is in the making for a while now (and almost done).

## Appendix: Shogun Linear regression code

import shogun as sg import numpy as np N = 30000 N_test = 300000 D = 1500 features_train = sg.RealFeatures(np.random.randn(D, N)) features_test = sg.RealFeatures(np.random.randn(D, N_test)) labels_train = sg.RegressionLabels(np.random.randn(N)) labels_test = sg.RegressionLabels(np.random.randn(N_test)) tau = 0.001 lrr = sg.LinearRidgeRegression(tau, features_train, labels_train) %time lrr.train(); lrr.apply_regression(features_test)

## Appendix: Shogun PCA code

import shogun as sg import numpy as np N = 30000 N_test = 300000 D = 1500 D_target = 20 features_train = sg.RealFeatures(np.random.randn(D, N)) features_test = sg.RealFeatures(np.random.randn(D, N_test)) labels_train = sg.RegressionLabels(np.random.randn(N)) labels_test = sg.RegressionLabels(np.random.randn(N_test)) preprocessor = sg.PCA() preprocessor.set_target_dim(D_target) %time preprocessor.init(features_train); preprocessor.apply_to_feature_matrix(features_test)

# Defended my PhD thesis!

I had the most pleasant, interesting & fun viva experience I could have wished for. This is thanks to my great examiners Manfred Opper and Mark Herbster, and of course due to the best supervisor imaginable, Arthur Gretton. Thank you!

The folks at Shogun dedicated the 6.1.0 release to celebrate. Nice one!.

# A determinant-free method to simulate the parameters of large Gaussian fields

Together with Louis Ellam, Iain Murray, and Mark Girolami, we just published / arXived a new article on dealing with large Gaussian models. This is slightly related to the open problem around the GMRF model in our Russian Roulette paper back a while ago.

We propose a determinant-free approach for simulation-based Bayesian inference in high-dimensional Gaussian models. We introduce auxiliary variables with covariance equal to the inverse covariance of the model. The joint probability of the auxiliary model can be computed without evaluating determinants, which are often hard to compute in high dimensions. We develop a Markov chain Monte Carlo sampling scheme for the auxiliary model that requires no more than the application of inverse-matrix-square-roots and the solution of linear systems. These operations can be performed at large scales with rational approximations. We provide an empirical study on both synthetic and real-world data for sparse Gaussian processes and for large-scale Gaussian Markov random fields.

Article is here. Unfortunately, the journal is not open-access, but the arXiv version is.

# Efficient and principled score estimation

New paper online: Score matching goes Nystrom. With guarantees!

We propose a fast method with statistical guarantees for learning an exponential family density model where the natural parameter is in a reproducing kernel Hilbert space, and may be infinite dimensional. The model is learned by fitting the derivative of the log density, the score, thus avoiding the need to compute a normalization constant. We improved the computational efficiency of an earlier solution with a low-rank, Nystr\”om-like solution. The new solution retains the consistency and convergence rates of the full-rank solution (exactly in Fisher distance, and nearly in other distances), with guarantees on the degree of cost and storage reduction. We evaluate the method in experiments on density estimation and in the construction of an adaptive Hamiltonian Monte Carlo sampler. Compared to an existing score learning approach using a denoising autoencoder, our estimator is empirically more data-efficient when estimating the score, runs faster, and has fewer parameters (which can be tuned in a principled and interpretable way), in addition to providing statistical guarantees.

# Google Summer of Code 2016

Great news: Shogun just got accepted to the GSoC 2016. After our break year in 2015, we are extremely excited to continue our GSoC tradition starting in 2011 (when I first joined Shogun).

If you are a student and wish to spend the summer hacking Machine Learning, guided by a vibrant international community of academics, professionals, and NERDS, then pay us a visit. Oh, and you will receive a cheque over $5000 from Google.

This year, we focus on framework improvements rather than solely adding new algorithms. Consequently, most projects have a heavy focus on packaging and software engineering questions. But there will be Machine Learning too. We are aiming high!

Check our our ideas list and read how to get involved.

# Adaptive Kernel Sequential Monte Carlo

And when your name is Dino Sejdinovic, you can actually integrate out all steps in feature space…..

Ingmar Schuster wrote a nice blog post about our recently arxived paper draft. It is about using the Kameleon Kernel Adaptive Metropolis-Hastings proposal as an MCMC rejuvenation step in a Sequential Monte Carlo context.

# Kamiltonian Monte Carlo

Together with Dino, Sam, Zoltan, and Arthur, I ~~recently arxived a first draft~~ published an article on a project that combines two topics — the combination of which I find rather exciting: kernel methods and Hamiltonian Monte Carlo.

**Updates.**

**January:**Presented the poster at MCMSki**December III:**Arthur gave a talk (slides) the scalable Monte Carlo workshop at NIPS 2015**December II:**We had**December I:**NIPS camera ready version now on arXiv**November**: Implementation of kernel hmc online**October**: Implementation of gradient estimation online**September:**Accepted at NIPS 2015! (and a name change to come …)**July**: Our reply to Xian’s blog post was also featured. Check out the IPython notebook I wrote for that.**June 2015**: Our draft was discussed on Xian’s Og and caused a lively discussion on the sense and non-sense of pseudo-marginal samplers and regularisation strength of kernel methods.

Here is the abstract.

## We propose

~~Kamiltonian~~Kernel Hamiltonian Monte Carlo (KMC), a gradient-free adaptive MCMC algorithm based on Hamiltonian Monte Carlo (HMC). On target densities where classical HMC is not an option due to intractable gradients, KMC adaptively learns the target’s gradient structure by fitting an exponential family model in a Reproducing Kernel Hilbert Space. Computational costs are reduced by two novel efficient approximations to this gradient. While being asymptotically exact, KMC mimics HMC in terms of sampling efficiency, and offers substantial mixing improvements over state-of-the-art gradient free samplers. We support our claims with experimental studies on both toy and real-world applications, including Approximate Bayesian Computation and exact-approximate MCMC.

**Motivation: HMC with intractable gradients.**

Many recent applications of MCMC focus on models where the target density function is intractable. A very simple example is the context of Pseudo-Marginal MCMC (PM-MCMC), for example in Maurizio’s paper on Bayesian Gaussian Processes classification. In such (simple) models the *marginal likelihood $p(\mathbf{y}|\theta )$ *is unavailable in closed form, but only can be estimated. Performing Metropolis-Hastings style MCMC on $\hat{p}(\theta |\mathbf{y})$ results in a Markov chain that (remarkably) converges to the *true* posterior. So far so good. But no gradients.

Sometimes, people argue that for simple objects as latent Gaussian models, it is possible to side-step the intractable gradients by running an MCMC chain on the joint space $(\mathbf{f},\theta)$ of latent variables and hyper-parameters, which makes gradients available (and also comes with a set of other problems such as high correlations between $\mathbf{f}$ and $\theta$, etc). While I don’t want to get into this here (we doubt existence of a one-fits-all solution), there is yet another case where gradients are unavailable.

*Approximate Bayesian Computation* is based on the context where the likelihood itself is a black-box, i.e. it can only be simulated from. Imagine a physicist coming to you with three decades of intuition in the form of some Fortran code, which contains bits implemented by people who are not alive anymore — and he wants to do Bayesian inference on the code-parameters via MCMC… Here we have to give up on getting the exact answer, but rather simulate from a biased posterior. And of course, no gradients, no joint distribution.

State-of-the-art methods on such targets are based on adaptive random walks, as no gradient information is available per-se. The Kameleon (KAMH) improves over other adaptive MCMC methods by constructing locally aligned proposal covariances. Wouldn’t it be cooler to harness the power of HMC?

**Kamiltonian Monte Carlo** starts as a Random Walk Metropolis (RWM) and then smoothly transitions into HMC. It explores the target *at least as fast* as RWM (we proof that), but improves the mixing in areas where it has been before.

We do this by learning the target gradient structure from the MCMC trajectory in an adaptive MCMC framework — using kernels. Every MCMC iteration, we update our gradient estimator with a decaying probability $a_t$ that ensures that we never stop updating, but update less and less, i.e. $$\sum_{t=1}^\infty \frac{1}{a_t}=\infty\qquad\text{and}\qquad\sum_{t=1} ^\infty \frac{1}{a_t^2}=0.$$ Christian Robert challenged our approach: using non-parametric density estimates for MCMC proposals directly is a bad idea: if certain parts of the space are not explored before adaptation effectively stopped, the sampler will almost never move there. For KMC (and for KAMH too) however, this is not a problem: rather than using density estimators as proposal directly, we use them for *proposal construction*. This way, these algorithms inherit ergodicity properties from random walks. I coded an example-demo here.

**Kernel exponential families as gradient surrogates.**

The density (gradient) estimator itself is an infinite dimensional exponential family model. More precisely, we model the un-normalised target log-density $\pi$ as an RKHS function $f\in\mathcal{H}$, i.e. $$\text{const}\times\pi(x)\approx\exp\left(\langle f,k(x,\cdot)\rangle_{{\cal H}}-A(f)\right),$$ which in particular implies $\nabla f\approx\nabla\log\pi.$ Surprisingly, and despite a complicated normaliser $A(f)$, such a model can be consistently fitted by directly minimising the expected $L^2$ distance of model and true gradient, $$J(f)=\frac{1}{2}\int\pi(x)\left\Vert \nabla f(x)-\nabla\log \pi (x)\right\Vert _{2}^{2}dx.$$ The magic word here is score-matching. You could ask: “Why not use kernel density estimation?” The answer: “Because it breaks in more than a few dimensions.” In contrast, we are actually able to make the above gradient estimator work in ~100 dimensions on Laptops.

**Two approximations.**

The über-cool infinite exponential family estimator, like all kernel methods, doesn’t scale nicely in the number $n$ of data used for estimation — and here neither in the in input space dimension $d$. Matrix inversion with costs $\mathcal{O}(t^3d^3)$ becomes a blocker, in particular as $t$ here is the growing number of MCMC iterations. KMC comes in two variants, which correspond to different approximations to the original kernel exponential family model.

*KMC lite*expresses the log-density in a smaller dimensional (yet growing) sub-space, via collapsing all $d$ input space dimensions into one. It takes the dual form $$f(x)=\sum_{i=1}^{n}\alpha_{i}k(x_{i},x),$$ where the $x_i$ are $n$ random sub-samples (just like KAMH) from the Markov chain history. Downside: KMC lite has to be re-estimated whenever the $x_i$ change. Advantage: The estimator’s tails vanish outside the $x_i$, i.e. $\lim_{\|x\|_{2}\to\infty}\|\nabla f(x)\|_{2}=0$, which translates into a geometric ergodicity result as we will see later.*KMC finite*approximates the model as a*finite*dimensional exponential family in primal form, $$f(x)=\langle\theta,\phi_{x}\rangle_{{\cal H}_{m}}=\theta^{\top}\phi_{x},$$ where $x\in\mathbb{R}^{d}$ is embedded into a finite dimensional feature space $\phi_{x}\in{\cal H}_{m}=\mathbb{R}^{m}.$ While other choices are possible, we use the Randon Kitchen Sink framework: a $m$-dimensional data independent random Fourier basis. Advantage: KMC lite is an efficient on-line estimator that can be*updated at constant costs*in $t$ — we can fit it on*all*of the MCMC trajectory. Disadvantage: Its do not decay and our proof for geometric ergodicity of KMC lite does not apply.

**Increasing dimensions.**

So far, we did not work out how the approximation errors propagate through the kernel exponential family estimator, but we plan to do that at some point. The paper contains an empirical study which shows that the gradients are good enough for HMC up to ~100 dimensions — Under a “Gaussian like smoothness” assumption. The below plots show the acceptance probability along KMC trajectories and quantify how “close” KMC proposals are to HMC proposals.

**From RWM to HMC.**

Using the well known and (ab)used Banana density as a target, we feed a *non-adaptive* version of KMC and friends with an increasing number of so-called “oracle” samples (iid from the target), and then quantify how well they mix. While this scenario is totally straw-man, it allows to compare the mixing behaviour of the algorithms *after* a long burn-in. The below plots show KMC transitioning from a random walk into something close to HMC as the number of “oracle” samples (x-axis) increases.**ABC — Reduced simulations and no additional bias.**

While there is another example in the paper, I want to show the ABC one here, which I find most interesting. Recall in ABC, the likelihood is not available. People often use synthetic likelihoods that are for example Gaussian, which induces an additional bias (in addition to ABC itself) but might improve statistical efficiency. In an algorithm called Hamiltonian ABC, such a synthetic likelihood is combined with stochastic gradient HMC (SG-HMC) via randomized finite differences, called simultaneous perturbation stochastic approximation (SPSA), which works as follows. To evaluate the gradient at a position $\theta$ in sampling space:

- Generate a random SPSA mask $\Delta$ (set of binary directions) and compute the perturbed $\theta+\Delta$ and $\theta+\Delta$.
- Interpolate linearly between them the perturbations. That is simulate from the ABC likelihood at both points, construct the synthetic likelihood, and use their difference.
- The gradient is a (step-size dependent!) approximation to the unknown gradient of the (biased) synthetic likelihood model.
- Perform a single stochastic HMC leapfrog step (adding friction as described in the SG-HMC paper)
- Iterate for $L$ leapfrog iterations.

What I find slightly irritating is that this algorithm needs to simulate from the ABC likelihood in *every* leapfrog iteration — and then discards the gained information after a leapfrog step has been taken. How many leapfrog steps are common in HMC? Radford Neal suggests to start with $L\in[100,1000]$. Quite a few ABC simulations come with this! But there are more issues:

*SG-HMC mixing*. I found that stochastic gradient HMC mixes very poorly when the gradient noise large. Reducing noise costs even more ABC simulations. Wrongly estimated noise (non-stationary !?!) induces bias due to the “always accept” mentality. Step-size decreasing to account for that further hurts mixing.*Bias*. The synthetic likelihood can fail spectacularly, if the true likelihood is skewed for example.

KMC does not suffer from either of those problems: It keeps on improving its mixing as it sees more samples, while only requiring *a single* ABC simulation at each MCMC iterations — rather than HABC’s $2L$ (times noise-reduction-repetitions). It targets the true (modulo ABC bias) posterior, while accumulating information of the target (and its gradients) in the form of the MCMC trajectory.

**But how do you choose the kernel parameter?**

Often, this question is a threat for any kernel-based algorithm. For example, for the KAMH algorithm, it is completely unclear (to us!) how select these parameters in a principled way. For KMC, however, we can simply use cross-validation on the score matching objective function above. In practice, we use a black box optimisation procedure (for example CMA or Bayesian optimisation) to on-line update the kernel hyper-parameters every few MCMC iterations. See an example where our Python code does this here. Just like the updates to the gradient estimator itself, this can be done with a decaying probability to ensure asymptotic correctness.

~~Unbiased~~ Big Bayes without sub-sampling bias: Paths of Partial Posteriors

Together with Dino Sejdinovic and Mark Girolami, I recently arxived our second draft of a paper on the popular topic of how to scale–up Bayesian inference when facing large datasets.

This project is yet another method to scale up Bayesian inference via sub-sampling. But unlike most other work on this field, we do not use sub-sampling within an MCMC algorithm to simulate from the posterior distribution of interest. Rather, we use sub-sampling to create a series of easier inference problems, whose results we then combine to get the answer to the question we were originally asking — in an estimation sense, without full posterior simulation. I’ll describe the high-level ideas now.

**Updates on the project:**

**Jun:**I was invited to present the work at the Big Bayes workshop in Oxford.**Mar:**Change of title by popular demand 😉**Mar:**I presented the work at the CSML lunch at UCL.**Feb:**Our reply to Christian’s critique was also featured. Thanks!**Feb:**We got featured in Xi’an’s Og, including a nice discussion in the comments**Feb:**I presented the work at the Oxford Machine Learning lunch. Good discussion with the crowd. We will probably change the name to*Big Bayes without sub-sampling bias*.**Jan:**I gave a 15 minute talk at the Theory of Big data workshop at UCL.

Let’s assume the following desiderata for large-scale Bayesian estimators:

- Average computational costs
*sub-linear*in the data size $N$. - Bounded variance (even for growing $N$) and can be controlled.
- No sub-sampling bias is introduced. The method inherits the usual finite time bias from MCMC though.
- Not limited to certain models or inference schemes.
- Trade off redundancy in the data against computational savings.
- Perform competitive in practice.

**Bayesian inference** usually involves integrating a functional $\varphi:\Theta\rightarrow\mathbb{R}$ over a posterior distribution $$\phi=\int d\theta\varphi(\theta)\underbrace{p(\theta|{\cal D})}_{\text{posterior}},$$ where $$p(\theta|{\cal D})\propto\underbrace{p({\cal D}|\theta)}_{\text{likelihood data}}\underbrace{p(\theta)}_{\text{prior}}.$$ For example, to compute the predictive posterior mean for a linear regression model. This is often done by constructing an MCMC algorithm to simulate $m$ approximately iid samples from approximately $\theta^{(j)}\sim p(\theta|{\cal D})$ (the second “approximately” here refers to the fact that MCMC *is biased* for any finite length of the chain, and the same is true for our method) and then performing Monte Carlo integration $$\phi\approx\frac{1}{m}\sum_{j=1}^{m}\varphi(\theta^{(j)}).$$ Constructing such Markov chains can quickly become infeasible as usually, *all data* needs to be touched in *every iteration*. Take for example the standard Metropolis-Hastings transition kernel to simulate from $\pi(\theta)\propto p(\theta|{\cal D})$, where at a current point $\theta^{(j)}$, a proposed new point $\theta’\sim q(\theta|\theta^{(j)})$ is accepted with probability $$\min\left(\frac{\pi(\theta’)}{\pi(\theta^{(j)})}\times\frac{q(\theta^{(j)}|\theta’)}{q(\theta’|\theta^{(j)})},1\right).$$ Evaluating $\pi$ requires to iterate through the whole dataset. A natural question therefore is: is it possible to only use subsets of $\mathcal{D}$?

**Existing work. **There has been a number of papers that tried to use sub-sampling *within* the transition kernel of MCMC:

*Stochastic gradient Langevin*. Use sub-sampling in stochastic gradient descent with decreasing step-size.*Austerity in MCMC land*: Perform a hypothesis test to decide whether to accept/reject*An adaptive sub-sampling approach*: Use concentration bounds to decide whether to accept/reject. Quantify error, establish convergence.*Firefly Monte Carlo*: Augment the state space of the Markov chain and combine with a lower bound on the likelihood

All these methods have in common that they either introduce bias (in addition to the bias caused by MCMC), have no convergence guarantees, or mix badly. This comes from the fact that it is extremely difficult to modify the Markov transition kernel and maintain its asymptotic correctness.

In our paper, we therefore take a different perspective. *If* the goal is to solve an *estimation* problem as the one above, we should not make our life hard trying to *simulate. *We construct a method that directly estimates posterior expectations — *without* simulating from the full posterior, and *without* introducing sub-sampling bias.

**The idea** is very simple:

- Construct a series of
*partial posterior distributions*$\tilde{\pi}_{l}:=p({\theta|\cal D}_{l})\propto p({\cal D}_{l}|\theta)p(\theta)$ from subsets of sizes $$\vert\mathcal{D}_1\vert=a,\vert\mathcal{D}_2\vert=2^{1}a,\vert\mathcal{D}_3\vert=2^{2}a,\dots,\vert\mathcal{D}_L\vert=2^{L-1}a$$ of the data. This gives $$p(\theta)=\tilde{\pi}_{0}\rightarrow\tilde{\pi}_{1}\rightarrow\tilde{\pi}_{2}\rightarrow\dots\rightarrow\tilde{\pi}_{L}=\pi_{N}=p({\theta|\cal D}).$$ - Compute posterior expectations $\phi_{t}:=\hat{\mathbb{E}}_{\tilde{\pi}_{t}}\{\varphi(\theta)\}$ for each of the partial posteriors. This can be done for example MCMC, or other inference methods. This gives a sequence of real valued partial posterior expectations — that converges to the true posterior.
- Use the debiasing framework, by Glynn & Rhee. This is a way to unbiasedly estimate the limit of a sequence
*without*evaluating all elements. Instead, one randomly truncates the sequence at term $T$ (which should be “early” in some sense, $T$ is a random variable), and then computes $$\phi^*=\sum_{t=1}^T \frac{\phi_{t}-\phi_{t-1}}{\mathbb{P}\left[T\geq t\right]}.$$ This is now an unbiased estimator for the full posterior expectation. The intuition is that one can rewrite any sequence limit as a telescopic sum $$\lim_{t\to\infty} \phi_t=\sum_{t=1}^\infty (\phi_{t}-\phi_{t-1})$$ and then randomly truncate the infinite sum and divide through truncation probabilities to leave the expectation untouched, i.e. $$\mathbb{E}[\phi^*]=\phi,$$where $\phi$ is the sequence limit. - Repeat $R$ times for reducing the estimator variance as $1/R$.

In the paper, we choose the random truncation time as $$\mathbb{P}(T=t)\propto 2^{-\alpha t}.$$ From here, given a certain choice of $\alpha$, it is straight-forward to show the average computational cost of this estimator is $\mathcal{O} (N^{1-\alpha}),$ which is *sub-linear* in $N$. We also show that variance is finite and can be bounded even when $N\to\infty$. This statement is modulo certain conditions on the $\phi_t$. In addition, the choice of the $\alpha$-parameter of the random truncation time $T$ is crucial for performance. See the paper for details how to set it.

**Experimental results** are very encouraging. On certain models, we are able to estimate model parameters accurately before plain MCMC even has finished a *single* iteration. Take for example a log-normal model $$p(\mu,\sigma|{\cal D})\propto\prod_{i=1}^{N}\log{\cal N}(x_{i}|\mu,\sigma^{2})\times\text{flat prior}$$ and functional $$\varphi((\mu,\sigma))=\sigma.$$ We chose the true parameter $\sigma=\sqrt{2}$ and ran our method in $N=2^{26}$ data. The pictures below contain some more details.

**Comparison to stochastic variational inference** is something we only thought of quite late in the project. As we state in the paper, as mentioned in my talks, and as pointed out in feedback from the community, we *cannot* do uniformly better than MCMC. Therefore, comparing to full MCMC is tricky — even though there exist trade-offs of data redundancy and model complexity that we *can* exploit as showed in the above example. Comparing to other “limited data” schemes really showcases the strength (and generality) of our method.

We chose to compare against “Gaussian Processes for Big Data“, which puts together a decade of engineering of variational lower bounds, the inducing features framework for GPs, and stochastic variational inference to solve Gaussian Process regression on roughly a million data points. In contrast, our scheme is very general and *not *specifically designed for GPs. However, we are able to perform similar to this state of the art method. On predicting airtime delays, we reach a slightly lower RMSE at comparable computing effort. The paper contains details on our GP, which is slightly different (but comparable) to the one used in the GP for Big Data paper. Another neat detail of the GP example is that the likelihood does not factorise, which is in contrast to most other Bayesian sub-sampling schemes.

**In summary**, we arrive at

- Computational costs
*sub-linear*in $N$ *No*introduced sub-sampling bias (in addition to MCMC, if used)- Finite, and
*controllable*variance - A general scheme that is not limited to MCMC nor to factorising likelihoods
- A scheme that trades off redundancy for computational savings
- A practical scheme that performs competitive to state-of-the-art methods and that has
*no problems with transition kernel design.* - See the paper for a list of weaknesses and open problems.