GSoC 2013

Shogun got accepted in the Google Summer of Code 2013!

To read my blog about the GSoC, click here.

Check out our ideas pageThis year, I will be a mentor rather than a student  and I am very excited about this.
I’ll be offering two projects:

  • Implement Gaussian process classification (joint with Oliver Stegle). This is an extension of the GSoC project last year and should be quite interested while not being too complicated (link)
  • Implement unbiased estimators of likelihoods of very large, sparse Gaussian distributions (joint with Erlend Aune and Daniel Simpson). This one is quite challenging since it involved many different topics. However, it should also be very interesting (link)

Shogun is in the GSoC 2013

Shogun got accepted in the Google Summer of Code 2013!

Check out our ideas pageThis year, I will be a mentor rather than a student  and I am very excited about this.

I’ll be offering two projects:

  • Implement Gaussian process classification (joint with Oliver Stegle). This is an extension of the GSoC project last year and should be quite interested while not being too complicated (link)
  • Implement unbiased estimators of likelihoods of very large, sparse Gaussian distributions (joint with Erlend Aune and Daniel Simpson). This one is quite challenging since it involved many different topics. However, it should also be very interesting (link)

Shogun blog posts

Shogun 2.1 is out!

We released SHOGUN 2.1. See the announcement (link).

The release features my recent work on kernel selection for MMD-based kernel two-sample testing and a streaming based implementation for this. See blog-entry. We also added a new unit-testing framework, of which I am very excited since we finally get a mechanism to detect code errors. We also got yet another interface language (perl). Very cool stuff and lots of work/blood/sweat/fun with the other guys. Check it out!

Next thing to come here is a workshop on machine learning with SHOGUN on July 12 in the C-Base in Berlin. Stay tuned!

SHOGUN – A large scale machine learning toolbox

shogun logoTo read my blog about SHOGUN development, click here.

SHOGUN (website) is a machine learning toolbox with focus is on large scale kernel methods and especially on Support Vector Machines. It provides a generic SVM interface for several different SVM state-of-the-art implementations

Each of the SVMs can be combined with a variety of kernels. The toolbox provides efficient implementations of many common kernels.

Also many other popular machine learning algorithms are implemented and the list is continuously extended for example due to the support of the Google Summer of Code. For example, there are now Gaussian processes, many dimensionality reduction methods, Structured Output and latent SVMs, various multi-task learning techniques, and many more.

SHOGUN is implemented in C++ and comes with interfaces to many languages.

I got into the team after the GSoC 2011 and since then have implemented some new features: A framework for cross-validation and model selection during the GSoC 2011 and a framework for kernel based statistical hypothesis testing in the GSoC 2012. I also worked on migrating serialized SHOGUN objects from different versions to one another.

NIPS paper: Optimal kernel choice for large-scale two-sample tests

NIPS 2012 is already over. Unfortunately, I could not go due to the lack of travel funding. However, as mentioned a few weeks ago, I participated in one paper which is closely related to my Master’s project with Arthur Gretton and Massi Pontil. Optimal kernel choice for large-scale two-sample tests. We recently set up a page for the paper where you can download my Matlab implementation of the paper’s methods. Feel free to play around with that. I am currently finishing implementing most methods into the SHOGUN toolbox. We also have a poster which was presented at NIPS. See below for all links.
Update: I have completed the kernel selection framework for SHOGUN, it will be included in the next release. See the base class interface here. See an example to use it: single kernel (link) and combined kernels (link). All methods that are mentioned in the paper are included. I also updated the shogun tutorial (link).

At its core, the paper describes a method for selecting the best kernel for two-sample testing with the linear time MMD. Given a kernel $k$ and terms
$$h_k((x_{i},y_{i}),((x_{j},y_{j}))=k(x_{i},x_{i})+k(y_{i},y_{i})-k(x_{i},y_{j})-k(x_{j},y_{i}),$$
the linear time MMD is their empirical mean,
$$\hat\eta_k=\frac{1}{m}\sum_{i=1}^{m}h((x_{2i-1},y_{2i-1}),(x_{2i},y_{2i})),$$
which is a linear time estimate for the squared distance of the mean embeddings of the distributions where the $x_i, y_i$ come from. The quantity allows to perform a two-sample test, i.e. to tell whether the underlying distributions are different.
Given a finite family of kernels $\mathcal{K}$, we select the optimal kernel via maximising the ratio of the MMD statistic by a linear time estimate of the standard deviation of the terms
$$k_*=\arg \sup_{k\in\mathcal{K}}\frac{ \hat\eta_k}{\hat \sigma_k},$$
where $\hat\sigma_k^2$ is a linear time estimate of the variance $\sigma_k^2=\mathbb{E}[h_k^2] – (\mathbb{E}[h_k])^2$ which can also be computed in linear time and constant space. We give a linear time and constant space empirical estimate of this ratio. We establish consistency of this empirical estimate as
$$ \left\vert \sup_{k\in\mathcal{K}}\hat\eta_k \hat\sigma_k^{-1} -\sup_{k\in\mathcal{K}}\eta_k\sigma_k^{-1}\right\vert=O_P(m^{-\frac{1}{3}}).$$

In addition, we describe a MKL style generalisation to selecting weights of convex combinations of a finite number of baseline kernels,
$$\mathcal{K}:={k : k=\sum_{u=1}^d\beta_uk_u,\sum_{u=1}^d\beta_u\leq D,\beta_u\geq 0, \forall u\in{1,…,d}},$$
via solving the quadratic program
$$\min_\beta { \beta^T\hat{Q}\beta : \beta^T \hat{\eta}=1, \beta\succeq 0},$$
where $\hat{Q}$ is the positive definite empirical covariance matrix of the $h$ terms of all pairs of kernels.

We then describe three experiments to illustrate

  • That our criterion outperforms existing methods on synthetic and real-life datasets which correspond to hard two-sample problems
  • Why multiple kernels can be an advantage in two-sample testing

See also the description of my Master’s project (link) for details on the experiments.

Download paper
Download poster
Download Matlab code (under the GPLv3)
Supplementary page

 

Streaming Features for Linear Time MMD

I finally finished an important and very cool extension to my GSoC 2012 project – making the linear time MMD statistic work with streaming based data. In particular, SHOGUN’s streaming framework is now used.

By design, the linear time MMD statistic, given as
[text{MMD}_l^2=frac{1}{m}sum_{i=1}^{m}h((x_{2i-1},y_{2i-1}),(x_{2i},y_{2i}))]
where
[h((x_{i},y_{i}),((x_{j},y_{j}))=k(x_{i},x_{i})+k(y_{i},y_{i})-k(x_{i},y_{j})-k(x_{j},y_{i})]
is very well suited for streaming based data since only four examples have to be hold in memory at once. Once, the sum in the h-statistic is computed, used data can be “forgotten”. As I described in my M.Sc. thesis (link), this allows to process infinite amounts of data and therefore results in possibly more accurate two-sample tests. This holds in particular in cases where the amount of data needed to solve problems is larger than computer memory.

During the GSoC, I implemented the linear time MMD on the base of SHOGUN’s standard features interface, which made it necessary to hold data in memory. With the latest modifications (link to patch), the class for the linear time MMD (class reference), now accepts streaming features (class reference) only. This allows to process arbitrarily large amounts of data in a very comfortable way. In order to not suffer from overhead while streaming examples one by one, a block size may be specified: this number of examples is processed at once and should be chosen as large as fits into memory.

Recall the linear time MMD’s distribution is normal and its variance can easily estimated by using the empirical variance of the individual h-statistics (while the MMD is their mean) when the number of samples is large enough. The new implementation in SHOGUN does this on the fly using D. Knuth’s online variance algorithm [1] (implementation link). Therefore, a complete two-sample test is now possible in linear time and constant space.

A nice illustration of the advantages of this approach can be found in the examples for the linear time MMD (link). A data generator for artificial data which implements SHOGUN’s streaming interface is passed to the MMD class. It produces data from the underlying distribution on the fly.

[1] Donald E. Knuth (1998). The Art of Computer Programming, volume 2: Seminumerical Algorithms, 3rd edn., p. 232. Boston: Addison-Wesley.

GSoC 2012 is over

Since a few weeks, GSoC 2012 is over. It has been a pretty cool summer for me. As last year, I learned lots of things. This year though, my project a bit more research oriented — which is nice since it allowed me to connect my work for SHOGUN with the stuff I do in Uni. I even mentioned it in my Master’s dissertation (link) which also was about statistical hypothesis testing with the MMD. Working on the dissertation at the same time as on the GSoC was sometimes exhausting. It eventually worked out fine since both things were closely related. I would only suggest to do other important things if they are connected to the GSoC project. However, if this condition is met, things multiply in terms of the reward you get due to synergistic effects.

The other students working for SHOGUN also did very cool projects. All these are included in the SHOGUN 2.0 release (link). The project now also has a new website so its worth taking a closer look. Some of the other (really talented) guys might stay with SHOGUN as I did last year. This once more gives a major boost to development. Thanks to all those guys. I also owe thanks to Sören and Sergey who organised most things and made this summer so rewarding.

In the near future I will try to put in some extensions to the statistical testing framework that I though of during the summer but did not have time to implement: On-line features for the linear time MMD, a framework for kernel selection which includes all investigated methods from my Master’s dissertation, and finally write unit-tests using SHOGUN’s new framework for that. I will update the SHOGUN project page of my website (link). I might as well send some tweets to SHOGUN’s new twitter account (link).

Master’s dissertation: Adaptive Large-Scale Kernel Two-Sample Testing

I recently finished working on my Master project here at UCL. It was supervised by Arthur Gretton and Massimiliano Pontil and was about kernel selection for a linear version of the Maximum Mean Discrepancy, a kernel based two-sample test. Download report here.

Given sets of samples of size m from two probability distributions, a two-sample test decides whether the distributions are the same or different with some confidence level. The linear time MMD statistic is defined as

\[\text{MMD}_l^2=\frac{1}{m}\sum_{i=1}^{m}h((x_{2i-1},y_{2i-1}),(x_{2i},y_{2i}))\]
where
\[h((x_{i},y_{i}),((x_{j},y_{j}))=k(x_{i},x_{i})+k(y_{i},y_{i})-k(x_{i},y_{j})-k(x_{j},y_{i})\]
and \(k\) is an RKHS reproducing kernel (I used the Gaussian kernel only).
A two sample test works simply as this: if this statistic (computed on sample data) is larger than a computed threshold (also on data), it is likely that the two sets of samples are from different distributions.

(Null and alternative distributions of MMD statistic, red area represents type I error, blue area represents type II error)

The linear time version is highly interesting for large-scale problems since one does not have to store data in order to compute it. Instead, it is possible to compute statistic and threshold in an on-line way.

The work contains three main contributions:

  1. An on-line version for the already existing linear time two-sample test. More important, it was shown in experiments that in some situations, the linear time test is a better choice than the current quadratic time MMD state-of-the-art method. This for example happens when problems are so hard that the amount of data necessary to solve them does not fit into computer memory. On the blobs dataset described in the work,a quadratic time test on the maximum processable amount of data reached a bad type II error while with the linear time version and much more data, almost zero type II error could be reached. Another case is when simply infinite data (but finite computation time) is available: the (adaptive) linear time test reaches lower type II error that its quadratic counterpart.

  2. A description of a criterion that can be used for kernel selection for the linear time MMD. Null and alternative distribution of the statistic have appealing properties that can be exploited in order to select the optimal kernel in the sense that a test’s type II error is minimised. The criterion is the ratio of MMD statistic and its standard deviation. This pulls null and alternative distribution apart while minimising their variance.
    In experiments, this criterion performed better or at least equal than existing methods for kernel selection. This is especially true when the length scale at which probability distributions differ is different to the overall length scale, as for example in the above shown blobs dataset.
    (Opt and MaxRat methods are based on the criterion and perform better than existing methods, X-Val-Type II is another newly described method, blobs dataset)
  3. A MKL-style generalisation of two-sample testing on the base of finite linear combinations of baseline kernels of the form
    \[
    \mathcal{K}:=\{k : k=\sum_{u=1}^d\beta_uk_u,\sum_{u=1}^d\beta_u\leq D,\beta_u\geq0, \forall u\in\{1,…,d\}\}
    \]
    Optimal weights of these combinations can be computed via solving the convex program
    \[
    \min \{ \beta^T\hat{Q}\beta : \beta^T \hat{\eta}=1, \beta\succeq0\}
    \]
    where \(\hat{Q}\) is a linear time estimate of the covariance of the MMD estimates and \(\hat{\eta}\) is a linear time estimate of the MMD.Whenever combined kernels may capture more information relevant distinguishing probability distributions than one kernel, this method leads to better results.
    (A dataset where two dimensions provide more information than one)

    (Opt and \(L2\) use combined kernels)
    It also has an interesting feature selection interpretation in the sense that the two-sample test provides information on which kernels (and therefore domains) are useful for locating distinguishing characteristics of distributions.

All above plots and results can be found in the report. Many results are joint work and went to the article “Optimal kernel choice for large-scale two-sample tests”, which was accepted at NIPS 2012 while my report was written. Many thanks to the other authors, in particular to Arthur Gretton, Dino Sejdinovic, Bharath Sriperumbudur, and Massimiliano Pontil for all the fruitful discussions and guidance.

11th GSoC weekly report: Done!

This will be my last weekly report for this years summer of code! Last week, I did not write a report since I have been very busy with experiments for a rebuttal for the NIPS submission (see 2nd GSoC weekly report). This week was more productive: I continued polishing the new framework for statistical tests, squeezed out some final bugs and made made a few things more effective.

I also created graphical examples for linear and quadratic time MMD and HSIC based tests. These serve the purpose of illustrating how the methods work on simple datasets. They sample the underlying statistic’s null and alternative distributions using all different methods I implemented and plot distributions with test thresholds (as well as data). For the MMD tests, the dataset contains samples from two multivariate Gaussian distributions with unit variance in every component and equal means in all but one component. The HSIC tests uses data where dependence is induced via rotation (see last report). Below are screenshots of the output of the examples.

These images were also added to the shogun-tutorial. I added a part about independence testing and corrected some mistakes in there. All methods I implemented are now contained within the tutorial. Another documentation related thing I did was to update doxygen based sourcecode documentation. In particular, I cleaned up the horrible mess in the CStatistics class — and replaced all ascii-art by LaTeX. Although there are still things to do, my project is now in the status “done” in terms of GSoC 🙂 It was a nice summer! I guess I will be extending it with some ideas that came up while working on with kernel two sample tests recently.

For the last week, I intend to get some unit-testing done and start to focus on things that are needed for our upcoming 2.0 release (Bug hunting, fix warnings, implement things that people request). I will also write an overall summary on the GSoC next month or so. Next month will be busy since I also have to finish my Master’s project.