# Shogun participates in GSoC 2018

We are happy to announce that Shogun’s umbrella organisation, NumFOCUS, has been accepted for GSoC 2018 — and we will be mentoring through them!

If you are a talented student who wants to spend a summer hacking open-source, make sure to apply! Check out the NumFOCUS GSoC ideas page (contains Shogun’s and other projects’ ideas). Our very own ideas page can also be found in our wiki.

GSoC is a global program focused on bringing more student developers into open source software development. Students work with an open source organization on a 3 month programming project during their break from school.

Shogun provides efficient implementations of standard and state-of-the-art machine learning algorithms in an accessible, open-source environment. Shogun has been involved in GSoC since 2011, check out our blog posts about past years.

NumFOCUS is a nonprofit that is dedicated to support and promote world-class, innovative, open source scientific computing.

# Started an academic visit at ETHZ

Incredibly excited to be part of Gunnar Rätsch’s group in Zürich (while still being based at and involved with Gatsby). Part of my time will oficially go into Shogun development.

# Shogun now supports Intel MKL

Over the last few weeks and months, a few things came together that make Shogun both a lot easier to install, and a lot faster!

EDIT: While I was writing this post, Viktor leaked some of the results. I should work faster 😉

## Easier installation: conda integration & windows

Thanks to Dougal, who did an awesome job of  integrating shogun with conda, installing Shogun is now as easy as

conda install -c conda-forge shogun

Viktor recently made this work under windows as well (not easy! yet only C++ interface, but this will change soon). Check his StackOverflow post if you want to give it a try. After the years and years of cryptic installation procedures these things hopefully make Shogun more accessible for new users. Thanks again Dougal!

## Faster Shogun: Lapack, Eigen3 and Intel MKL

How did we make Shogun faster? Let’s take a little peak under the hood!

For fast Machine Learning algorithms, we need well-tuned implementations for linear algebra operations. One commonly used set of tools is LAPACK/BLAS: the BLAS standard is set of low-level routines for performing basic linear algebra operations is; LAPACK is a set of routines for slightly more complex operations (e.g. matrix factorisations) on top of BLAS.

If we had an Intel CPU, we could use Intel’s math kernel library (MKL), a LAPACK/BLAS suite optimised for Intel CPUs. (for an example, check out the benchmarks of MKL-anaconda vs standard anaconda). However, since it is proprietory software it used to be hard to get a copy without having to pay. So when Anaconda recently started shipping a free version of MKL with their Python distribution, Viktor got to work to harness MKL for Shogun.

While historically using LAPACK/BLAS in many places (mostly with openblas), the recent Shogun has a flexible linear algebra backend which heavily uses Eigen3, a header-only, template based C++ implementation of a lot of linear algebra. Eigen3 claims to be at least as fast as most free and non-free LAPACK/BLAS suites. BUT Eigen3 lacks parallel implementations of many matrix operations. This is crucial for many ML algorithms. On the other hand, MKL has those parallel implementations, so we want to use them.

How does it work?  Luckily, it is possible to compile Eigen3 code against MKL. Then MKL acts as a drop-in replacement within Shogun’s Eigen3/openblas backend. Long story short, let’s compare Shogun’s algorithms with Eigen3+openblas against the Eigen3+MKL version. Good news aside: Viktor recently set up a proper LAPACK detection in our cmake setup which makes everything work out of the box.

On a side note: I actually first found out about this when writing a paper on fast MMD implementations, where we compared an Eigendecomposition approach (MKL has multi-core versions!) with our own codes. Though we managed to beat it 🙂

## Compilig Shogun using Docker

To compare the performance of Shogun using MKL vs Eigen3/openblas, we need to have a Shogun version that links against each of them. The easiest way to get this into place — in a way that anyone reading this post could reproduce — is using a Docker container. If you install Shogun using conda (see above), the openblas version is downloaded, so in this case we want to compile from scratch.

I start from the official anaconda image, which currently is a Debian jessie. I download the image, fire up a container with it, and finally start a bash (make sure to read up on containers vs images).

sudo docker pull continuumio/anaconda3
sudo docker run -i -t continuumio/anaconda3 /bin/bash


## Installing dependencies

I want to compile Shogun, so I need a compiler and the C++ library (which are not part of the image). I also use a compiler cache that speed’s up compiling Shogun.

 apt-get install -qq --force-yes --no-install-recommends make gcc g++ libc6-dev ccache

Next, since I want to use Shogun from Python, I need swig to generate bindings to Shogun’s C++ core. Unfortunately, the current swig version in Debian jessie is too old (3.0.2) for Shogun, which needs at least 3.0.5. The same is true for cmake. But using conda makes updating those straight-forward:

conda install swig cmake


Ok, we need one more thing: anaconda comes with its shiny new MKL and Shogun’s Eigen3 will be compiled against it. The compiler therefore needs the MKL header files:

 conda install mkl-include

## Using Shogun without MKL (optional)

If you wanted to use Shogun’s non-MKL version, you could just install a precompiled binary version of Shogun using conda. If you want to however, compare the manually compiled versions with this installation, you would need to make conda forget about MKL (which installs openblas instead). This causes all MKL optimised packages to be re-installed (numpy, sklearn, etc). In addition, the blas header files are needed.

conda install -c anaconda nomkl
apt-get install libopenblas-dev

Most people will skip this step.

## Compiling the source code

Let’s download Shogun’s latest source code (development version after our new 6.1.2 release).

 cd /opt/
git clone https://github.com/shogun-toolbox/shogun.git 

Let’s configure the beast. There is some options I set here: disable GPL codes & examples (which take time to compile) and disable xml serialization (which has some funny errors in this setup). More importantly, I set the (install-)prefix to the conda distribution of the anaconda image.

 cd shogun
mkdir build
cd build
cmake .. -DINTERFACE_PYTHON=On -DLICENSE_GPL_SHOGUN=Off -DUSE_SVMLIGHT=Off -DBUILD_META_EXAMPLES=Off -DBUILD_EXAMPLES=Off -DENABLE_LIBXML2=Off -DCMAKE_PREFIX_PATH=/opt/conda -DCMAKE_INSTALL_PREFIX=/opt/conda 

Compile and install

 make -j 4
make install 

Let’s check that Shogun and its Python bindings do reference to either MKL or openblas. You can do that with

 ldd /opt/conda/lib/libshogun.so | grep 'mkl\|blas'
ldd /opt/conda/lib/python3.6/site-packages/_shogun.so | grep 'mkl\|blas' 

For the procedure I outlined in this post, you should see something like

libmkl_rt.so => not found

Nevermind the “not found”, which is related to a broken ld setup in the anaconda image. Shogun sorts this out for you. The point is that there is either MKL or openblas. If you removed all the MKL packages first and installed openblas instead, it should be in the lines of

libopenblas.so.0 => /usr/lib/libopenblas.so.0 (0x00007f38e6eac000)

## Comparing runtimes

I use a very simple code snippet to compare runtime of two Shogun algorithms: linear regression and PCA, both on random data, see below. Both of them are based on a matrix factorisation, where the multi-threaded MKL implementation can shine.

Here are the walltimes (from a single run). I have a X1 Carbon Thinkpad with an Intel i7-7500U CPU, which has 2 cores and 4 threads.

Openblas MKL
Linear regression 7.61 s 2.09 s
PCA 23 s 12.6 s

Pretty epic difference, especially given that this comes essentially for free. When running the benchmark and monitoring my CPU, I was surprised to see that openblas actually uses all four system threads, while MKL only uses two (it prefers it that way) That is what I call efficient!

It is also very interesting that in Viktor’s tweet above, Shogun with MKL can be quite a bit faster than sklearn. There is a lot of things to be benchmarked here: for example, in contrast to sklearn, our SVM solvers are accelerated through MKL as well, as we ported the code to using our linear algebra backend.

## Conclusions

BLAS/LAPACK is a complicated topic! One take-away for me is that it is worth reading a bit about those things, as they do make a big difference.

A next step is to benchmark everything properly, using the benchmark framework by Marcus and Ryan from MLPack. In particular, I am curious how Shogun+MKL will then do compared to other ML libraries.

We should probably also make Shogun’s binary distributions (at least the one on conda) include an MKL build by default. For that, Shogun would have to move to the conda default channel, as conda-forge cannot have MKL. And for that, we need a BSD compatible release (currently Shogun is licensed under the viral GPL), which is in the making for a while now (and almost done).

## Appendix: Shogun Linear regression code

import shogun as sg
import numpy as np

N = 30000
N_test = 300000
D = 1500

features_train = sg.RealFeatures(np.random.randn(D, N))
features_test = sg.RealFeatures(np.random.randn(D, N_test))
labels_train = sg.RegressionLabels(np.random.randn(N))
labels_test = sg.RegressionLabels(np.random.randn(N_test))
tau = 0.001
lrr = sg.LinearRidgeRegression(tau, features_train, labels_train)
%time lrr.train(); lrr.apply_regression(features_test)


## Appendix: Shogun PCA code


import shogun as sg
import numpy as np

N = 30000
N_test = 300000
D = 1500
D_target = 20

features_train = sg.RealFeatures(np.random.randn(D, N))
features_test = sg.RealFeatures(np.random.randn(D, N_test))
labels_train = sg.RegressionLabels(np.random.randn(N))
labels_test = sg.RegressionLabels(np.random.randn(N_test))

preprocessor = sg.PCA()
preprocessor.set_target_dim(D_target)

%time preprocessor.init(features_train); preprocessor.apply_to_feature_matrix(features_test)



# Defended my PhD thesis!

I had the most pleasant, interesting & fun viva experience I could have wished for. This is thanks to my great examiners Manfred Opper and Mark Herbster, and of course due to the best supervisor imaginable, Arthur Gretton. Thank you!

The folks at Shogun dedicated the 6.1.0 release to celebrate. Nice one!.

# A determinant-free method to simulate the parameters of large Gaussian fields

Together with Louis Ellam, Iain Murray, and Mark Girolami, we just published / arXived a new article on dealing with large Gaussian models. This is slightly related to the open problem around the GMRF model in our Russian Roulette paper back a while ago.

We propose a determinant-free approach for simulation-based Bayesian inference in high-dimensional Gaussian models. We introduce auxiliary variables with covariance equal to the inverse covariance of the model. The joint probability of the auxiliary model can be computed without evaluating determinants, which are often hard to compute in high dimensions. We develop a Markov chain Monte Carlo sampling scheme for the auxiliary model that requires no more than the application of inverse-matrix-square-roots and the solution of linear systems. These operations can be performed at large scales with rational approximations. We provide an empirical study on both synthetic and real-world data for sparse Gaussian processes and for large-scale Gaussian Markov random fields.

Article is here. Unfortunately, the journal is not open-access, but the arXiv version is.

# Efficient and principled score estimation

New paper online: Score matching goes Nystrom. With guarantees!

We propose a fast method with statistical guarantees for learning an exponential family density model where the natural parameter is in a reproducing kernel Hilbert space, and may be infinite dimensional. The model is learned by fitting the derivative of the log density, the score, thus avoiding the need to compute a normalization constant. We improved the computational efficiency of an earlier solution with a low-rank, Nystr\”om-like solution. The new solution retains the consistency and convergence rates of the full-rank solution (exactly in Fisher distance, and nearly in other distances), with guarantees on the degree of cost and storage reduction. We evaluate the method in experiments on density estimation and in the construction of an adaptive Hamiltonian Monte Carlo sampler. Compared to an existing score learning approach using a denoising autoencoder, our estimator is empirically more data-efficient when estimating the score, runs faster, and has fewer parameters (which can be tuned in a principled and interpretable way), in addition to providing statistical guarantees.

https://arxiv.org/abs/1705.08360

# Google Summer of Code 2016

Great news: Shogun just got accepted to the GSoC 2016. After our break year in 2015, we are extremely excited to continue our GSoC tradition starting in 2011 (when I first joined Shogun).

If you are a student and wish to spend the summer hacking Machine Learning, guided by a vibrant international community of academics, professionals, and NERDS, then pay us a visit. Oh, and you will receive a cheque over \$5000 from Google.

This year, we focus on framework improvements rather than solely adding new algorithms. Consequently, most projects have a heavy focus on packaging and software engineering questions. But there will be Machine Learning too. We are aiming high!

Check our our ideas list and read how to get involved.