# NIPS paper: Optimal kernel choice for large-scale two-sample tests

NIPS 2012 is already over. Unfortunately, I could not go due to the lack of travel funding. However, as mentioned a few weeks ago, I participated in one paper which is closely related to my Master’s project with Arthur Gretton and Massi Pontil. Optimal kernel choice for large-scale two-sample tests. We recently set up a page for the paper where you can download my Matlab implementation of the paper’s methods. Feel free to play around with that. I am currently finishing implementing most methods into the SHOGUN toolbox. We also have a poster which was presented at NIPS. See below for all links.
Update: I have completed the kernel selection framework for SHOGUN, it will be included in the next release. See the base class interface here. See an example to use it: single kernel (link) and combined kernels (link). All methods that are mentioned in the paper are included. I also updated the shogun tutorial (link).

At its core, the paper describes a method for selecting the best kernel for two-sample testing with the linear time MMD. Given a kernel $k$ and terms
$$h_k((x_{i},y_{i}),((x_{j},y_{j}))=k(x_{i},x_{i})+k(y_{i},y_{i})-k(x_{i},y_{j})-k(x_{j},y_{i}),$$
the linear time MMD is their empirical mean,
$$\hat\eta_k=\frac{1}{m}\sum_{i=1}^{m}h((x_{2i-1},y_{2i-1}),(x_{2i},y_{2i})),$$
which is a linear time estimate for the squared distance of the mean embeddings of the distributions where the $x_i, y_i$ come from. The quantity allows to perform a two-sample test, i.e. to tell whether the underlying distributions are different.
Given a finite family of kernels $\mathcal{K}$, we select the optimal kernel via maximising the ratio of the MMD statistic by a linear time estimate of the standard deviation of the terms
$$k_*=\arg \sup_{k\in\mathcal{K}}\frac{ \hat\eta_k}{\hat \sigma_k},$$
where $\hat\sigma_k^2$ is a linear time estimate of the variance $\sigma_k^2=\mathbb{E}[h_k^2] – (\mathbb{E}[h_k])^2$ which can also be computed in linear time and constant space. We give a linear time and constant space empirical estimate of this ratio. We establish consistency of this empirical estimate as
$$\left\vert \sup_{k\in\mathcal{K}}\hat\eta_k \hat\sigma_k^{-1} -\sup_{k\in\mathcal{K}}\eta_k\sigma_k^{-1}\right\vert=O_P(m^{-\frac{1}{3}}).$$

In addition, we describe a MKL style generalisation to selecting weights of convex combinations of a finite number of baseline kernels,
$$\mathcal{K}:={k : k=\sum_{u=1}^d\beta_uk_u,\sum_{u=1}^d\beta_u\leq D,\beta_u\geq 0, \forall u\in{1,…,d}},$$
$$\min_\beta { \beta^T\hat{Q}\beta : \beta^T \hat{\eta}=1, \beta\succeq 0},$$
where $\hat{Q}$ is the positive definite empirical covariance matrix of the $h$ terms of all pairs of kernels.

We then describe three experiments to illustrate

• That our criterion outperforms existing methods on synthetic and real-life datasets which correspond to hard two-sample problems
• Why multiple kernels can be an advantage in two-sample testing

Supplementary page

# Streaming Features for Linear Time MMD

I finally finished an important and very cool extension to my GSoC 2012 project – making the linear time MMD statistic work with streaming based data. In particular, SHOGUN’s streaming framework is now used.

By design, the linear time MMD statistic, given as
[text{MMD}_l^2=frac{1}{m}sum_{i=1}^{m}h((x_{2i-1},y_{2i-1}),(x_{2i},y_{2i}))]
where
[h((x_{i},y_{i}),((x_{j},y_{j}))=k(x_{i},x_{i})+k(y_{i},y_{i})-k(x_{i},y_{j})-k(x_{j},y_{i})]
is very well suited for streaming based data since only four examples have to be hold in memory at once. Once, the sum in the h-statistic is computed, used data can be “forgotten”. As I described in my M.Sc. thesis (link), this allows to process infinite amounts of data and therefore results in possibly more accurate two-sample tests. This holds in particular in cases where the amount of data needed to solve problems is larger than computer memory.

During the GSoC, I implemented the linear time MMD on the base of SHOGUN’s standard features interface, which made it necessary to hold data in memory. With the latest modifications (link to patch), the class for the linear time MMD (class reference), now accepts streaming features (class reference) only. This allows to process arbitrarily large amounts of data in a very comfortable way. In order to not suffer from overhead while streaming examples one by one, a block size may be specified: this number of examples is processed at once and should be chosen as large as fits into memory.

Recall the linear time MMD’s distribution is normal and its variance can easily estimated by using the empirical variance of the individual h-statistics (while the MMD is their mean) when the number of samples is large enough. The new implementation in SHOGUN does this on the fly using D. Knuth’s online variance algorithm [1] (implementation link). Therefore, a complete two-sample test is now possible in linear time and constant space.

A nice illustration of the advantages of this approach can be found in the examples for the linear time MMD (link). A data generator for artificial data which implements SHOGUN’s streaming interface is passed to the MMD class. It produces data from the underlying distribution on the fly.

[1] Donald E. Knuth (1998). The Art of Computer Programming, volume 2: Seminumerical Algorithms, 3rd edn., p. 232. Boston: Addison-Wesley.

# GSoC 2012 is over

Since a few weeks, GSoC 2012 is over. It has been a pretty cool summer for me. As last year, I learned lots of things. This year though, my project a bit more research oriented — which is nice since it allowed me to connect my work for SHOGUN with the stuff I do in Uni. I even mentioned it in my Master’s dissertation (link) which also was about statistical hypothesis testing with the MMD. Working on the dissertation at the same time as on the GSoC was sometimes exhausting. It eventually worked out fine since both things were closely related. I would only suggest to do other important things if they are connected to the GSoC project. However, if this condition is met, things multiply in terms of the reward you get due to synergistic effects.

The other students working for SHOGUN also did very cool projects. All these are included in the SHOGUN 2.0 release (link). The project now also has a new website so its worth taking a closer look. Some of the other (really talented) guys might stay with SHOGUN as I did last year. This once more gives a major boost to development. Thanks to all those guys. I also owe thanks to Sören and Sergey who organised most things and made this summer so rewarding.

In the near future I will try to put in some extensions to the statistical testing framework that I though of during the summer but did not have time to implement: On-line features for the linear time MMD, a framework for kernel selection which includes all investigated methods from my Master’s dissertation, and finally write unit-tests using SHOGUN’s new framework for that. I will update the SHOGUN project page of my website (link). I might as well send some tweets to SHOGUN’s new twitter account (link).

# Master’s dissertation: Adaptive Large-Scale Kernel Two-Sample Testing

I recently finished working on my Master project here at UCL. It was supervised by Arthur Gretton and Massimiliano Pontil and was about kernel selection for a linear version of the Maximum Mean Discrepancy, a kernel based two-sample test. Download report here.

Given sets of samples of size m from two probability distributions, a two-sample test decides whether the distributions are the same or different with some confidence level. The linear time MMD statistic is defined as

$\text{MMD}_l^2=\frac{1}{m}\sum_{i=1}^{m}h((x_{2i-1},y_{2i-1}),(x_{2i},y_{2i}))$
where
$h((x_{i},y_{i}),((x_{j},y_{j}))=k(x_{i},x_{i})+k(y_{i},y_{i})-k(x_{i},y_{j})-k(x_{j},y_{i})$
and $k$ is an RKHS reproducing kernel (I used the Gaussian kernel only).
A two sample test works simply as this: if this statistic (computed on sample data) is larger than a computed threshold (also on data), it is likely that the two sets of samples are from different distributions.

(Null and alternative distributions of MMD statistic, red area represents type I error, blue area represents type II error)

The linear time version is highly interesting for large-scale problems since one does not have to store data in order to compute it. Instead, it is possible to compute statistic and threshold in an on-line way.

The work contains three main contributions:

1. An on-line version for the already existing linear time two-sample test. More important, it was shown in experiments that in some situations, the linear time test is a better choice than the current quadratic time MMD state-of-the-art method. This for example happens when problems are so hard that the amount of data necessary to solve them does not fit into computer memory. On the blobs dataset described in the work,a quadratic time test on the maximum processable amount of data reached a bad type II error while with the linear time version and much more data, almost zero type II error could be reached. Another case is when simply infinite data (but finite computation time) is available: the (adaptive) linear time test reaches lower type II error that its quadratic counterpart.

2. A description of a criterion that can be used for kernel selection for the linear time MMD. Null and alternative distribution of the statistic have appealing properties that can be exploited in order to select the optimal kernel in the sense that a test’s type II error is minimised. The criterion is the ratio of MMD statistic and its standard deviation. This pulls null and alternative distribution apart while minimising their variance.
In experiments, this criterion performed better or at least equal than existing methods for kernel selection. This is especially true when the length scale at which probability distributions differ is different to the overall length scale, as for example in the above shown blobs dataset.
(Opt and MaxRat methods are based on the criterion and perform better than existing methods, X-Val-Type II is another newly described method, blobs dataset)
3. A MKL-style generalisation of two-sample testing on the base of finite linear combinations of baseline kernels of the form
$\mathcal{K}:=\{k : k=\sum_{u=1}^d\beta_uk_u,\sum_{u=1}^d\beta_u\leq D,\beta_u\geq0, \forall u\in\{1,…,d\}\}$
Optimal weights of these combinations can be computed via solving the convex program
$\min \{ \beta^T\hat{Q}\beta : \beta^T \hat{\eta}=1, \beta\succeq0\}$
where $\hat{Q}$ is a linear time estimate of the covariance of the MMD estimates and $\hat{\eta}$ is a linear time estimate of the MMD.Whenever combined kernels may capture more information relevant distinguishing probability distributions than one kernel, this method leads to better results.

(Opt and $L2$ use combined kernels)
It also has an interesting feature selection interpretation in the sense that the two-sample test provides information on which kernels (and therefore domains) are useful for locating distinguishing characteristics of distributions.

All above plots and results can be found in the report. Many results are joint work and went to the article “Optimal kernel choice for large-scale two-sample tests”, which was accepted at NIPS 2012 while my report was written. Many thanks to the other authors, in particular to Arthur Gretton, Dino Sejdinovic, Bharath Sriperumbudur, and Massimiliano Pontil for all the fruitful discussions and guidance.

# 11th GSoC weekly report: Done!

This will be my last weekly report for this years summer of code! Last week, I did not write a report since I have been very busy with experiments for a rebuttal for the NIPS submission (see 2nd GSoC weekly report). This week was more productive: I continued polishing the new framework for statistical tests, squeezed out some final bugs and made made a few things more effective.

I also created graphical examples for linear and quadratic time MMD and HSIC based tests. These serve the purpose of illustrating how the methods work on simple datasets. They sample the underlying statistic’s null and alternative distributions using all different methods I implemented and plot distributions with test thresholds (as well as data). For the MMD tests, the dataset contains samples from two multivariate Gaussian distributions with unit variance in every component and equal means in all but one component. The HSIC tests uses data where dependence is induced via rotation (see last report). Below are screenshots of the output of the examples.

These images were also added to the shogun-tutorial. I added a part about independence testing and corrected some mistakes in there. All methods I implemented are now contained within the tutorial. Another documentation related thing I did was to update doxygen based sourcecode documentation. In particular, I cleaned up the horrible mess in the CStatistics class — and replaced all ascii-art by LaTeX. Although there are still things to do, my project is now in the status “done” in terms of GSoC 🙂 It was a nice summer! I guess I will be extending it with some ideas that came up while working on with kernel two sample tests recently.

For the last week, I intend to get some unit-testing done and start to focus on things that are needed for our upcoming 2.0 release (Bug hunting, fix warnings, implement things that people request). I will also write an overall summary on the GSoC next month or so. Next month will be busy since I also have to finish my Master’s project.

# 10th GSoC weekly report: Slowly getting ready

Step by step, my project enters a final state 🙂
Last week, I added new data generation methods, which are used from a new example for independence tests with HSIC. It demonstrates that the HSIC based test is able to capture dependence which is induced by rotating data that has zero correlation — one of the problems from the paper [1]. Here is a picture; the question is: are the two dimensions dependent? Or moreover, is a test able to capture that? (correlation is almost zero, dependence is induced via rotation)

I also realised that my current class structure had problems doing bootstrapping for HSIC, so I re-factored a bit. Bootstrapping is now also available for HISC using the same code that does it for two-sample-tests. I also removed some redundancy — both independence and two-sample tests are very similar problems and implementations should share code where possible.

Another thing that was missing so far is to compute test thresholds; so far, only p-values could be computed. Since different people have different tastes about this, I added both methods. Checking a test statistic against a threshold is straight-forward and gives a binary answer; computing a p-value gives the position of the test statistic in the null-distribution — this contains more information. To compute thresholds, one needs the inverse CDF function for the null-distribution. In the bootstrapping case, it is easy since simply the sample that corresponds to a certain quantile has to be reported. For cases where a normal- or gamma-distribution was fitted, I imported some more routines from the nice ALGLIB toolbox.

For this week, I plan to continue with finishing touches, documentation, examples/tests, etc. Another idea I had is to make the linear time MMD test work with SHOGUN’s streaming features, since the infinite or streaming data case is the main area for its usage.

[1]: Gretton, A., Fukumizu, K., Teo, C., & Song, L. (2008). A kernel statistical test of independence. Advances in Neural Information Processing Systems

# 9th GSoC weekly report: Bugs again! and documentation

I spend quite some fraction of last week on something which is not really related my project: trying to make cross-validation possible for multi-class MKL (multiple kernel learning) machines using my framework from last year’s GSoC. To this end, I added subset support to SHOGUN’s combined features class; and then went for a bunch of bugs that prevented it from working. But it now does! So cross-validation should now be possible within a lot more situations. Thanks to Eric who reported all the problems.

Apart from that, I worked on documentation for the new statistical testing framework. I added doxygen class descriptions, see for example CQuadraticTimeMMD. More important, I started writing a section for the SHOGUN tutorial, a book-like description of all algorithms. We hope that it will grow in the future. You can find the $\LaTeX$ sources at github. We should/will add a live pdf download soon.

Another minor thing I implemented is a data generator class. I think it is nice to illustrate new algorithms with data that is not fixed (aka load a file). The nice thing about this is that it is available for examples from all interfaces — so far I implemented this separately for c++ and python; this is more elegant now. I bet some of the others projects will need similar methods for their demos too; so please extend the class!

This week, I will add more data generation methods to the generator, in particular data that can be used to illustrate the recently implemented HSIC test for independence. Reference datasets are quite complicated, so this might take a while. Another thing we recently changed is a new framework for unit-tests, so I will write these for all new methods I created recently.

# 8th GSoC weekly report: Examples, Bugs, and Kernel Choice

Last week was a mixed one. Next to new examples, tests, bugfixes, and helper methods, the biggest implementation is an automatic kernel selection algorithm for the linear time MMD. This is one thing that I worked on during my Master project at UCL.
It selects optimal optimal kernel weights for kernel of the family
$\mathcal{K}:=\{k : k=\sum_{u=1}^d\beta_uk_u,\sum_{u=1}^d\beta_u\leq D,\beta_u\geq0, \forall u\in\{1,…,d\}\}$
by solving the convex program
$\min \{ \beta^T\hat{Q}\beta : \beta^T \hat{\eta}=1, \beta\succeq0\}$
where $\hat{Q}$ is a linear time estimate of the covariance of the MMD estimates and $\hat{\eta}$ is a linear time estimate of the MMD.

I already described this a few weeks ago, when the method was developed. It is now integrated into SHOGUN. Efficient kernel selection, yeah 🙂 It uses a convex solver called libqp, which is by Vojtech Franc, one of the mentors of this year’s GSoC. Still, I need to think of a nice way of embedding it into SHOGUN’s model selection framework, which isn’t as straight-forward as it first seems.

This week, bug-hunting continues with a bug that gives wrong results during cross-validation on multi-class machines. Afterwards, I will try to polish my code so far a bit, especially documentation (and tutorial); and continue on more examples/demo for the new framework for statistical testing.

# 7th GSoC weekly report: Hilbert Schmidt Independence Criterion

Finally, I started on kernel based (in)dependence tests last week. These are tests that try to find out whether for two random variables $\textbf{x},\textbf{y}$ are independent, i.e. whether their joint distribution factorises into the individual ones. The null hypothesis (that may be rejected) is $H_0:P_\textbf{x}P_\textbf{y}=P_{\textbf{x}\textbf{y}}$

These kind of tests basically work like two-sample tests: Given one set of samples from each random variable
$Z=(X,Y)=\{(x_1,y_1,…,(x_m,y_m)\}$
a test statistic is computed and then compared against the distribution of the statistic under the null-hypothesis. If the position is in an upper part of it, the null-hypothesis is rejected since it is unlikely that the current value was generated by it.

The class of independence tests I will implement for my project are all based on the Hilbert Schmidt independence criterion (HSIC), which takes out the above procedure to an reproducing kernel Hilbert space (RKHS). The (biased version of the) HSIC statistic itself is simply given by
$\text{HSIC}_b(Z)=\frac{1}{m^2}\text{trace}(KHLH)$
where $K,L$ are kernel matrices of the input samples $X,Y$ in some RKHS and $H=I-\frac{1}{m}\textbf{1}\textbf{1}^T$ is a centring matrix.

I integrated a general modular framework for independence tests into SHOGUN. The HSIC class is the first kernel-independence test that works. Interfaces are very similar to the two-sample test, however, they are not quite the same for various reasons. That’s why there is another class for independence testing next to the one for two-sample testing.

As for the two-sample tests, the null-distribution may simply be approximated by bootstrapping, i.e. merging the samples and computing the statistic for many times. This is now possible for any independence test. Another method to approximate the null-distribution for HSIC is fitting a Gamma distribution [1] as

$m\text{HSIC}=\frac{x^{\alpha-1}\exp(-\frac{x}{\beta})}{\beta^\alpha \Gamma(\alpha)}$ where
$\alpha=\frac{(\textbf{E}(\text{HSIC}_b(Z)))^2}{\text{var}(\text{HSIC}_b(Z))} \quad \text{and}\quad\beta=\frac{m\text{var}(\text{HSIC}_b(Z))}{\textbf{E}(\text{HSIC}_b(Z))}$

It’s also already implemented! There are already modular interfaces for the new classes and some simple tests. I will extend these during this weak. Time passes fast: The mid-term-evaluation is this week already. I pretty much enjoyed the first half 🙂

[1]: Gretton, A., Fukumizu, K., Teo, C., & Song, L. (2008). A kernel statistical test of independence.

# 6th GSoC weekly report: First modular examples and other stuff

Last week’s changes were all rather subtle:

• I created some first modular examples in python,
• fixed this big bug in the model selection trees I talked about last week (nasty!),
• added some convenience methods for the two-sample-test constructors (there is now a new method in CFeatures to append feature objects)
• and corrected a bunch of bugs on the fly.

This week, I will do some more work on the examples and then start working on independence testing.