GSoC 2012

gsoc

To read my blog about my participation in the GSoc 2012, click here.

I participated the GSoC 2012 for SHOGUN! The project I worked on was closely related to my Master’s project at UCL. It is about kernel based statistical tests. My host ist Arthur Gretton, lecturer with the Gatsby Computational Neuroscience Unit, part of the Centre for Computational Statistics and Machine Learning at UCL, who I met there during my studies.
Abstract: Statistical tests for dependence or difference are an important tool in data-analysis. However, when data is high-dimensional or in non-numerical form (strings, graphs), classical methods fail. This project implements recently developed kernel-based generalizations of statistical tests, which overcome this issue. The kernel-two-sample test based on the Maximum-Mean-Discrepancy (MMD) tests whether two sets of samples are from the same or from different distributions. Related to the kernel-two-sample test is the Hilbert-Schmidt-Independence criterion (HSIC), which tests for statistical dependence between two sets of samples. Multiple tests based on the MMD and the HSIC are implemented along with a general framework for statistical tests in SHOGUN.

My proposal can be found here. SHOGUN got 8 student slots, compared to 5 in 2011, so this summer was a major boost in SHOGUN development. Check out the cool others’ students projects here.

 

Accepted!

gsoc

Yeah! I just got accepted into the GSoC 2012 for SHOGUN! The project I will work on this year is closely related to my Master’s project at UCL. It is about kernel based statistical tests. My host ist Arthur Gretton, lecturer with the Gatsby Computational Neuroscience Unit, part of the Centre for Computational Statistics and Machine Learning at UCL, who I met there during my studies.
Abstract: Statistical tests for dependence or difference are an important tool in data-analysis. However, when data is high-dimensional or in non-numerical form (strings, graphs), classical methods fail. This project implements recently developed kernel-based generalizations of statistical tests, which overcome this issue. The kernel-two-sample test based on the Maximum-Mean-Discrepancy (MMD) tests whether two sets of samples are from the same or from different distributions. Related to the kernel-two-sample test is the Hilbert-Schmidt-Independence criterion (HSIC), which tests for statistical dependence between two sets of samples. Multiple tests based on the MMD and the HSIC are implemented along with a general framework for statistical tests in SHOGUN.

My proposal can be found here. I am looking extremely forward to this. This year, SHOGUN got 8 student slots, compared to 5 last year, so this summer will probably bring a major boost in SHOGUN development. Check out the cool others’ students projects here.

 

Bachelor’s disseration: Adaptive Kernel Methods for Sequence Classification in Bioinformatics

I wrote my Bachelor thesis in 2009/2010 in Duisburg, supervised by Prof. Hoffmann, Bioinformatics, Essen, and Prof. Pauli, Intelligent systems, Duisburg.

The work is about classification of amino acid sequences using SVM and string kernels. In particular, I compared the Distant Segments kernel by Sébastian Boisvert to a standard Spectrum kernel on a HIV dataset. In addition, I described a bisection based method to search for the soft-margin parameter of the underlying SVM, which outperformed a standard grid-search.

download

GSoC 2011

I participated in the GSoC 2011 for the SHOGUN machine learning toolbox (link). This was awesome! The program brings together students (like me) and open-source organisations. You are getting paid to work full-time on a project you choose. I could really use the money and learned lots of lots of cool things and met nice people.

My project was mentored by Soeren Sonnenburg and had the title “Built a flexible cross-validation framework into shogun”. Here is the abstract:
Nearly every learning machine has parameters which have to be determined manually. Shogun currently lacks a model selection framework. Therefore, the goal of this project is to extend shogun to make cross-validation possible. Different strategies, how training data is split up should be available and easy to exchange. Various model selection schemes are integrated (train,validation,test split, n-fold cross validation, leave one out cross validation, etc)


The proposal I wrote can be found here. My motivation for the project came from the fact that I actually used SHOGUN for my Bachelor thesis (link). I had to do model-selection by hand these days. A major portion of the programming work I did would not have been necessary if model selection already was a part of SHOGUN. Nowadays, quite some people use the stuff I wrote during the summer 2011.