2nd GSoC weekly report

Last week, I was again very busy with exams and doing experiments for a NIPS submission.

The latter is somehow related to my GSoC project, and I will implement it once the other stuff is done:
We developed a method for selecting optimal coefficients of a non-negative combination of kernels for the linear time (=large scale) MMD-two-sample test. The criterion that is optimised for is the ratio of the linear MMD \(\eta_k\) by its standard deviation \(\sigma_k \), i.e.
\[ k_*=\arg \sup_{k\in\mathcal{K}} \eta_k \sigma_k^{-1}\]. That is equivalent to solving the quadratic program
\[
\min \{ \beta^T\hat{Q}\beta : \beta^T \hat{\eta}=1, \beta\succeq0\}
\]
where the combination of kernels is given by
\[
\mathcal{K}:=\{k : k=\sum_{u=1}^d\beta_uk_u,\sum_{u=1}^d\beta_u\leq D,\beta_u\geq0, \forall u\in\{1,…,d\}\}
\]
\(\hat{Q}\) is a linear time estimate of the covariance of the MMD estimates and \(\hat{\eta}\) is a linear time estimate of the MMD using the above kernel combinations.

Apart from that, I implemented a method to approximate the null-distribution of the quadratic time MMD, which is based on the Eigenspectrum of the kernel matrix of the merged samples from the two distributions, based on [1]. It still needs to be compared against the MATLAB implementation. It comes with some minor helper functions around matrix algebra.

This week, I will finally have my last exam and then continue on the advanced methods for computing test thresholds.

[1]: Gretton, A., Fukumizu, K., & Harchaoui, Z. (2011). A fast, consistent kernel two-sample test.

First GSoC weekly report

I am currently quite busy with my exams, however, the last three will be done soon and I still managed to do initial sketches for the statistical testing framework along with helping on solving problems that occurred because of the massive changes that are currently happening to SHOGUN’s label and multi-class system.

Here you can find an UML diagram of the class structure so far. I implemented first simple kernel-two-sample-tests — the ones based on the linear and the quadratic time MMD metric. For computing a p-value, these two may approximate their null-distribution using a (brute-force) bootstrapping approach based on shuffling data of the two underlying distributions and then computing the statistic multiple times. The bootstrapping code will work for any two-sample based test.

Next steps are: Advanced methods for estimating Null-distributions for the MMD tests.

I also worked with Arthur (mentor) on a version of the MMD that is related to my Master project: A convex combination of (arbritary) kernels for the linear time MMD where the optimal weights are learned by solving a quadratic program. I might implement that into SHOGUN as well. (Who can help me how to interface the QP-solver of SHOGUN?)

GSoC 2012

gsoc

To read my blog about my participation in the GSoc 2012, click here.

I participated the GSoC 2012 for SHOGUN! The project I worked on was closely related to my Master’s project at UCL. It is about kernel based statistical tests. My host ist Arthur Gretton, lecturer with the Gatsby Computational Neuroscience Unit, part of the Centre for Computational Statistics and Machine Learning at UCL, who I met there during my studies.
Abstract: Statistical tests for dependence or difference are an important tool in data-analysis. However, when data is high-dimensional or in non-numerical form (strings, graphs), classical methods fail. This project implements recently developed kernel-based generalizations of statistical tests, which overcome this issue. The kernel-two-sample test based on the Maximum-Mean-Discrepancy (MMD) tests whether two sets of samples are from the same or from different distributions. Related to the kernel-two-sample test is the Hilbert-Schmidt-Independence criterion (HSIC), which tests for statistical dependence between two sets of samples. Multiple tests based on the MMD and the HSIC are implemented along with a general framework for statistical tests in SHOGUN.

My proposal can be found here. SHOGUN got 8 student slots, compared to 5 in 2011, so this summer was a major boost in SHOGUN development. Check out the cool others’ students projects here.

 

Accepted!

gsoc

Yeah! I just got accepted into the GSoC 2012 for SHOGUN! The project I will work on this year is closely related to my Master’s project at UCL. It is about kernel based statistical tests. My host ist Arthur Gretton, lecturer with the Gatsby Computational Neuroscience Unit, part of the Centre for Computational Statistics and Machine Learning at UCL, who I met there during my studies.
Abstract: Statistical tests for dependence or difference are an important tool in data-analysis. However, when data is high-dimensional or in non-numerical form (strings, graphs), classical methods fail. This project implements recently developed kernel-based generalizations of statistical tests, which overcome this issue. The kernel-two-sample test based on the Maximum-Mean-Discrepancy (MMD) tests whether two sets of samples are from the same or from different distributions. Related to the kernel-two-sample test is the Hilbert-Schmidt-Independence criterion (HSIC), which tests for statistical dependence between two sets of samples. Multiple tests based on the MMD and the HSIC are implemented along with a general framework for statistical tests in SHOGUN.

My proposal can be found here. I am looking extremely forward to this. This year, SHOGUN got 8 student slots, compared to 5 last year, so this summer will probably bring a major boost in SHOGUN development. Check out the cool others’ students projects here.

 

GSoC 2011

I participated in the GSoC 2011 for the SHOGUN machine learning toolbox (link). This was awesome! The program brings together students (like me) and open-source organisations. You are getting paid to work full-time on a project you choose. I could really use the money and learned lots of lots of cool things and met nice people.

My project was mentored by Soeren Sonnenburg and had the title “Built a flexible cross-validation framework into shogun”. Here is the abstract:
Nearly every learning machine has parameters which have to be determined manually. Shogun currently lacks a model selection framework. Therefore, the goal of this project is to extend shogun to make cross-validation possible. Different strategies, how training data is split up should be available and easy to exchange. Various model selection schemes are integrated (train,validation,test split, n-fold cross validation, leave one out cross validation, etc)


The proposal I wrote can be found here. My motivation for the project came from the fact that I actually used SHOGUN for my Bachelor thesis (link). I had to do model-selection by hand these days. A major portion of the programming work I did would not have been necessary if model selection already was a part of SHOGUN. Nowadays, quite some people use the stuff I wrote during the summer 2011.