I gave a talk at the EMBL-European Bioninformatic institute in Cambridge, where I visited the group of Oliver Stegle.
The topic was "Adaptive Large-Scale Kernel Two-Sample Testing". Slides can be found here behind this link.
Shogun got accepted in the Google Summer of Code 2013!
Check out our ideas page. This year, I will be a mentor rather than a student and I am very excited about this.
I'll be offering two projects:
I will blog about this here.
We released SHOGUN 2.1. See the announcement (link).
The release features my recent work on kernel selection for MMD-based kernel two-sample testing and a streaming based implementation for this. See blog-entry. We also added a new unit-testing framework, of which I am very excited since we finally get a mechanism to detect code errors. We also got yet another interface language (perl). Very cool stuff and lots of work/blood/sweat/fun with the other guys. Check it out!
Next thing to come here is a workshop on machine learning with SHOGUN on July 12 in the C-Base in Berlin. Stay tuned!
NIPS 2012 is already over. Unfortunately, I could not go due to the lack of travel funding. However, as mentioned a few weeks ago, I participated in one paper which is closely related to my Master's project with Arthur Gretton and Massi Pontil. Optimal kernel choice for large-scale two-sample tests. We recently set up a page for the paper where you can download my Matlab implementation of the paper's methods. Feel free to play around with that. I am currently finishing implementing most methods into the SHOGUN toolbox. We also have a poster which was presented at NIPS. See below for all links.
Update: I have completed the kernel selection framework for SHOGUN, it will be included in the next release. See the base class interface here. See an example to use it: single kernel (link) and combined kernels (link). All methods that are mentioned in the paper are included. I also updated the shogun tutorial (link).
At its core, the paper describes a method for selecting the best kernel for two-sample testing with the linear time MMD. Given a kernel \(k\) and terms
\[h_k((x_{i},y_{i}),((x_{j},y_{j}))=k(x_{i},x_{i})+k(y_{i},y_{i})-k(x_{i},y_{j})-k(x_{j},y_{i}),\]
the linear time MMD is their empirical mean,
\[\hat\eta_k=\frac{1}{m}\sum_{i=1}^{m}h((x_{2i-1},y_{2i-1}),(x_{2i},y_{2i})),\]
which is a linear time estimate for the squared distance of the mean embeddings of the distributions where the \(x_i, y_i\) come from. The quantity allows to perform a two-sample test, i.e. to tell whether the underlying distributions are different.
Given a finite family of kernels \(\mathcal{K}\), we select the optimal kernel via maximising the ratio of the MMD statistic by a linear time estimate of the standard deviation of the terms
\[k_*=\arg \sup_{k\in\mathcal{K}}\frac{ \hat\eta_k}{\hat \sigma_k},\]
where \(\hat\sigma_k^2\) is a linear time estimate of the variance \(\sigma_k^2=\mathbb{E}[h_k^2] - (\mathbb{E}[h_k])^2\) which can also be computed in linear time and constant space. We give a linear time and constant space empirical estimate of this ratio. We establish consistency of this empirical estimate as
\[ \left| \sup_{k\in\mathcal{K}}\hat\eta_k \hat\sigma_k^{-1} -\sup_{k\in\mathcal{K}}\eta_k\sigma_k^{-1}\right|=O_P(m^{-\frac{1}{3}}).\]
In addition, we describe a MKL style generalisation to selecting weights of convex combinations of a finite number of baseline kernels,
\[\mathcal{K}:=\{k : k=\sum_{u=1}^d\beta_uk_u,\sum_{u=1}^d\beta_u\leq D,\beta_u\geq0, \forall u\in\{1,...,d\}\},\]
via solving the quadratic program
\[\min \{ \beta^T\hat{Q}\beta : \beta^T \hat{\eta}=1, \beta\succeq0\},\]
where \(\hat{Q}\) is the positive definite empirical covariance matrix of the \(h\) terms of all pairs of kernels.
We then describe three experiments to illustrate
See also the description of my Master's project (link) for details on the experiments.
Download paper
Download poster
Download Matlab code (under the GPLv3)
Supplementary page
I finally finished an important and very cool extension to my GSoC 2012 project - making the linear time MMD statistic work with streaming based data. In particular, SHOGUN's streaming framework is now used.
By design, the linear time MMD statistic, given as
\[\text{MMD}_l^2=\frac{1}{m}\sum_{i=1}^{m}h((x_{2i-1},y_{2i-1}),(x_{2i},y_{2i}))\]
where
\[h((x_{i},y_{i}),((x_{j},y_{j}))=k(x_{i},x_{i})+k(y_{i},y_{i})-k(x_{i},y_{j})-k(x_{j},y_{i})\]
is very well suited for streaming based data since only four examples have to be hold in memory at once. Once, the sum in the h-statistic is computed, used data can be "forgotten". As I described in my M.Sc. thesis (link), this allows to process infinite amounts of data and therefore results in possibly more accurate two-sample tests. This holds in particular in cases where the amount of data needed to solve problems is larger than computer memory.
During the GSoC, I implemented the linear time MMD on the base of SHOGUN's standard features interface, which made it necessary to hold data in memory. With the latest modifications (link to patch), the class for the linear time MMD (class reference), now accepts streaming features (class reference) only. This allows to process arbitrarily large amounts of data in a very comfortable way. In order to not suffer from overhead while streaming examples one by one, a block size may be specified: this number of examples is processed at once and should be chosen as large as fits into memory.
Recall the linear time MMD's distribution is normal and its variance can easily estimated by using the empirical variance of the individual h-statistics (while the MMD is their mean) when the number of samples is large enough. The new implementation in SHOGUN does this on the fly using D. Knuth's online variance algorithm [1] (implementation link). Therefore, a complete two-sample test is now possible in linear time and constant space.
A nice illustration of the advantages of this approach can be found in the examples for the linear time MMD (link). A data generator for artificial data which implements SHOGUN's streaming interface is passed to the MMD class. It produces data from the underlying distribution on the fly.
[1] Donald E. Knuth (1998). The Art of Computer Programming, volume 2: Seminumerical Algorithms, 3rd edn., p. 232. Boston: Addison-Wesley.
This will be my last weekly report for this years summer of code! Last week, I did not write a report since I have been very busy with experiments for a rebuttal for the NIPS submission (see 2nd GSoC weekly report). This week was more productive: I continued polishing the new framework for statistical tests, squeezed out some final bugs and made made a few things more effective.
I also created graphical examples for linear and quadratic time MMD and HSIC based tests. These serve the purpose of illustrating how the methods work on simple datasets. They sample the underlying statistic's null and alternative distributions using all different methods I implemented and plot distributions with test thresholds (as well as data). For the MMD tests, the dataset contains samples from two multivariate Gaussian distributions with unit variance in every component and equal means in all but one component. The HSIC tests uses data where dependence is induced via rotation (see last report). Below are screenshots of the output of the examples.
These images were also added to the shogun-tutorial. I added a part about independence testing and corrected some mistakes in there. All methods I implemented are now contained within the tutorial. Another documentation related thing I did was to update doxygen based sourcecode documentation. In particular, I cleaned up the horrible mess in the CStatistics class -- and replaced all ascii-art by LaTeX. Although there are still things to do, my project is now in the status "done" in terms of GSoC :) It was a nice summer! I guess I will be extending it with some ideas that came up while working on with kernel two sample tests recently.
For the last week, I intend to get some unit-testing done and start to focus on things that are needed for our upcoming 2.0 release (Bug hunting, fix warnings, implement things that people request). I will also write an overall summary on the GSoC next month or so. Next month will be busy since I also have to finish my Master's project.
Sören wrote a nice summarising blog post on the GSoC 2012. See here.
Since a few weeks, GSoC 2012 is over. It has been a pretty cool summer for me. As last year, I learned lots of things. This year though, my project a bit more research oriented -- which is nice since it allowed me to connect my work for SHOGUN with the stuff I do in Uni. I even mentioned it in my Master's dissertation (link) which also was about statistical hypothesis testing with the MMD. Working on the dissertation at the same time as on the GSoC was sometimes exhausting. It eventually worked out fine since both things were closely related. I would only suggest to do other important things if they are connected to the GSoC project. However, if this condition is met, things multiply in terms of the reward you get due to synergistic effects.
The other students working for SHOGUN also did very cool projects. All these are included in the SHOGUN 2.0 release (link). The project now also has a new website so its worth taking a closer look. Some of the other (really talented) guys might stay with SHOGUN as I did last year. This once more gives a major boost to development. Thanks to all those guys. I also owe thanks to Sören and Sergey who organised most things and made this summer so rewarding.
In the near future I will try to put in some extensions to the statistical testing framework that I though of during the summer but did not have time to implement: On-line features for the linear time MMD, a framework for kernel selection which includes all investigated methods from my Master's dissertation, and finally write unit-tests using SHOGUN's new framework for that. I will update the SHOGUN project page of my website (link). I might as well send some tweets to SHOGUN's new twitter account (link).
Subscribe Blog Feed