Great news: Shogun just got accepted to the GSoC 2016. After our break year in 2015, we are extremely excited to continue our GSoC tradition starting in 2011 (when I first joined Shogun).

If you are a student and wish to spend the summer hacking Machine Learning, guided by a vibrant international community of academics, professionals, and NERDS, then pay us a visit. Oh, and you will receive a cheque over $5000 from Google. This year, we focus on framework improvements rather than solely adding new algorithms. Consequently, most projects have a heavy focus on packaging and software engineering questions. But there will be Machine Learning too. We are aiming high! Check our our ideas list and read how to get involved. Shogun 4.0 and GSoC 2014 follow up No, this is not about Fernando’s and mine honeymoon … The Shogun team just released version 4.0 of their community driven Machine Learning toolbox. This release most of all features the work of our 8 Google Summer of Code 2014 students, so this blog post is dedicated to them — you guys rock. This also brings an end to yet another very active year of Shogun: we organised a second workshop in Berlin, and I presented Shogun in to the public in London, New York and Berlin. For the 4th time, Shogun participated in Google’s wonderful program which more than anything boosted the team’s size and motivation. What else makes people spend sleepless nights hunting bugs for the sake of Machine Learning for everyone? This year was the first time that I organised our participation. This ranged from writing the application last second, harrassing potential mentors until they say ‘yes’, to making up overly ambitious projects to scare students away, and ending up mentoring too many students on my own. Jokes aside, this was a very challenging (in particular time-wise) but also a very rewarding experience that definitely sharpened my project organisation skills. As in the previous year, I tried to fuse my scientific life and Shogun’s GSoC participation — kernel methods and variational learning is something I touch on a daily basis at Gatsby. Many mentors were approached after having met them at scientific Machine Learning conferences, and being exposed to ML on for some years now, it is also easier to help students implement and write about popular ML algorithms. Here is a list. (Note that all projects come with really nice IPython notebooks — something that we continued to insist on from last year.) Fundamental ML algorithms by Parijat Mazumdar (parijat). Mentor: Fernando Shogun needs more standard ML algorithms. Parijat implemented some of those: random forests, kernel density estimation and more. Parijat’s code quality is amazing and together with Fernando’s superb mentoring skills (his first time mentoring), this project is likely to have been very sustainable. Notebook random forest, notebook KDE. Kernel testing and feature selection by Rahul De (lambday). Mentor: Dino Sejdinovic, Heiko Previous year’s student lambday continued to rock. First, he massively extended my 2012 project on kernel hypothesis testing to Big Data land. Dino, who was one of the invited speakers in the Shogun workshop last summer, and I are actually working on a journal article where we will use this code. Second, he extended the framework to perform feature selection via dependence measures. Third, he initiated and guided development of a framework for unifying Shogun’s linear algebra operations. This for example can be used to change existing algorithms from CPU to GPU with a compile switch — useful also for our deep learning project. Variational Inference for Gaussian Processes by Wu Lin (yorkerlin). Mentor:Heiko, Emtiyaz Khan In our third GSoC project on GPs, Wu took a couple of state-of-the-art approximate variational inference methods developed by Emiyaz, and put them into Shogun’s framework. The result of this very involved and technical project is that we now have large-scale classification using GPs. Emtiyaz also was a speaker at ourworkshop. Notebook Shogun missionary by Saurabh. Mentor:Heiko The idea of this project was to showcase Shogun’s abilities — sometimes we definitely need to work on. Saurabh wrote a couple of Notebooks that are essentially ML tutorials using Shogun. If you want to know about ML basics, regression, classification, model-selection, SVMs, multiclass, multiple kernel learning this was for you. He also extended our web-demo framework to for example include model-selection for GPs. OpenCV integration by Kislay. Mentor: Kevin Kislay, after writing a very cool notebook on PCA for his application, wrote data-structures to bridge between Shogun and OpenCV. The project was supervised by Kevin, who is also one of our former GSoC students This makes it possible to use the too libraries together in a neat way. Deep learning by Khaled Nasr. Mentor: Theofanis, Sergey The hype is on! After NIPS, Facebook, GoogleDeepmind, Shogun now also joined 😉 Khaled did a very good job in coding up the standard ones, and was involved in generalising Shogun’s linear algebra on the fly with lambday. This is a project that is likely to have a second part. Check his superb notebooks on deep belief neural networks, convolutional networks,autoencoders, restricted Boltzman machines SO Learning with Approximate Inference by Jiaolong. Mentor: hushell, Thoralf This was another project that was (co-)mentored by a former GSoC student. With the help his mentors, Jialong implemented various approximate inference methods for structured output (SO) models. Check out his notebook. Large-Scale Multi-Label Classification by Abinash. Mentor: Thoralf Another project involving our structured output expert Thoralf as mentor. Abinash implemented large-scale multilabel learning — beating scikit learn‘s implementation both in runtime and accuracy. The last experiment is described in this notebook. Finally, we sent two of our delegates (Thoralf and Fernando) to the 10th year jubilee mentor summit in California in late October. Really cool: I got lucky and won Google’s lottery on some extra places, so I could also join. The summit once again was overly colourful, bursting with creative minds who have the most diverse set of opinions and approaches, but who are all united by their excitement about open-source. The beauty of this community to me really lies in the people who do work purely driven by their interest on *the thing itself*, independent of competitive and in particular commercial interests — sometimes almost to an extend that is beyond any form of compromise. A wonderful illustration of this was when at the mentor summit, during the reception in the Tech museum in San Jose: Google’s speaker and head of finance Patrick Pichette (disclaimer: not sure, don’t quote me on this) who is the boss of Chris DiBona, who himself organises the GSoC, searched to inspire the audience to “think BIG” and to “change the lifes of GSoC students”. Guest speaker Linus Torvalds 10 minutes later then contemplates that he could not be a GSoC mentor as he would scare people away and that the best way to get involved in open-source is to “start small” — a sentence after which P.P. left the room. Funny enough: in GSoC, this community is then hugged by a super capitalistic American internet company — and gladly lets it happen: we all love GSoC and Shogun certainly would not be where it is without it. I also want to mention the day Google rented a whole theme park for us nerds — which made Fernando try a roller-coaster for the first time after being pushed by MLPack maintainer Ryan and myself. After being horrified at first, he even started to talk about C++ the second or third time. As you would expect from attending such geeky meetings, Thoralf, Fernado, and I also spent quite some time hacking Shogun, discussing ideas until late night (of course getting emotional about them 🙂 ). I managed to take a picture of Fernando falling asleep while hacking Shogun’s modular interfaces. Some of those ideas are collected on our wiki. • Improve usability • Making Shoun more modular and slim • Improving Shogun’s efficiency Some of those ideas are also part of our theme for our GSoC 2015 application and our planned Hackathon. We have come to a point where we seriously need to focus on application and stability rather than adding more and more cutting-edge algorithms — Shogun’s almost 15 year old framework needs a face lift. GSoC students will see that this years project ideas will focus on cleaning up the toolbox and implement ML applications. Meet the Shogun/MLPack crew, as nerdy as it gets 😉 GSoC Interview with Sergey and me Sergey and me gave an interview on Shogun and Google Summer of Code. Here it is: The internet. More specifically #shogun on irc.freenode.net. Wasn’t IRC that thing that our big brothers used as a socialising substitute when they were teenagers back in the 90s? Anyways. We are talking to two of the hottest upcoming figures in machine learning open-source software, the Russian software entrepreneur Sergey Lisitsyn, and the big German machine Heiko Strathmann. Hi guys, glad to meet you. Would you mind introducing yourself? Sergey (S): Hey, I am Sergey. If you ask me what do I do apart from Shogun – I am currently working as a software engineer and finishing my Master’s studies at Samara State Aerospace University. I joined Shogun in 2011 as a student and now I am doing my best to help guys from the Shogun team to keep up with GSoC 2014. Heiko (H): Hej, my name is Heiko. I do a Phd in Neuroscience & Machine Learning at the Gatsby Institute in London and joined Shogun three years ago during GSoC. I love open-source since my days in school. Your project, Shogun, is about Machine Learning. That sounds scary and sexy, but what is it really? H: My grandmother recently sent me an email asking about this ‘maschinelles Lernen’. I replied it is the art of finding structure in data in an automated way. She replied: Since when are you an artist? And what is this “data”? I showed her the movie PI by Darren Aronofsky where the main character at some point is able to predict stock prices after realising “the pattern”, and said that’s what we want to do with a computer. Since then, she is worried about me because the guy puts a drill into his head in the end….. Another cool application is for example to model brain patterns to allow people to learn how to use a prosthesis faster. S: Or have you seen your iPhone detects faces? That’s just a Support Vector Machine (SVM). It employs kernels which are inner products of non-linear mappings of Haar features into a reproducing kernel Hilbert Space so that it minimizes …. Yeah, okok… What is the history of Shogun in the GSoC? S: The project got started by Sören in his student days around 15 years ago. It was a research only tool for a couple of years before being made public. Over the years, more and more people joined, but the biggest boost came from GSoC… H: We just got accepted into our 4th year in that program. We had 5+8+8 students so far who all successfully did the program with us. Wow I guess that’s a few million dollars. (EDITOR: actually 105,000$.) GSoC students forced Shogun to grow up in many ways: github, a farm of buildbots, proper unit-testing, a cloud-service, web-demos, etc were all set up by students. Also, the diversity of algorithms from latest research increased a lot. From the GSoC money, we were able to fund our first Shogun workshop in Berlin last summer.

How did you two got into Shogun and GSoC? Did the money play a role?

H: I was doing my undergraduate project back in 2010, which actually involved kernel SVMs, and used Shogun. I thought it would be a nice idea of putting my ideas into it — also I was lonely coding just on my own. 2010, they were rejected from GSoC, but I eventually implemented my ideas in 2011. The money to me was very useful as I was planning to move to London soon. Being totally broke in that city one year later, I actually paid my rent from my second participation’s stipend – which I got for implementing ideas from my Master’s project at uni. Since 2013, I mentor other students and help organising the project. I think I would have stayed around without the money, but it would have been a bit tougher.

S: We were having a really hard winter in Russia. While I was walking my bear and clearing the roof of the snow, I realised I forgot to turn off my nuclear missile system…..

H: Tales!

S: Okay, so on another cold night I noticed a message on GSoC somewhere and then I just glanced over the list of accepted organizations and Shogun’s description was quite interesting so I joined a chat and started talking to people – the whole thing was breathtaking for me. As for the money – well, I was a student and was about to start my first part-time job as a developer – it was like a present for me but it didn’t play the main role!

H: To make it short: Sergey suddenly appeared and rocked the house coding in lightspeed, drinking Vodka.

But now you are not paid anymore, while still spending a lot of time on the project. What motivates you to do this?

S: This just involves you and you feels like you participate in something useful. Such kind of appreciation is important!

H: Mentoring students is very rewarding indeed! Some of those guys are insanely motivated and talented. It is very nice to interact with the community with people from all over the world sharing the same interest. Trying to be a scientist, GSoC is also very useful in producing tools that myself or my colleagues need, but that nobody has the time to build properly. You see, there are all sorts of synergic effects in GSoC and my day-job at university, such as meeting new people or getting a job since you know how to code in a team.

How does this work? Did you ever publish papers based on GSoC work?

S: Yeah, I actually published a paper based on my GSoC 2011 work. It is called ‘Tapkee: An Efficient Dimension Reduction Library’ and was recently published in the Journal of Machine Learning Research. We started writing it up with my mentor Christian (Widmer) and later Fernando (Iglesias) joined our efforts. It took enormous amount of time but we did it! Tapkee by the way is a Russian word for slippers.

H: I worked on a project on statistical simulation of global ozone data last year. The code is mainly based on one of my last year’s student’s project – a very clever and productive guy from Mumbai who I would never have met without the program, see http://www.ucl.ac.uk/roulette/ozoneexample

So you came all the way from being a student with GSoC up to being an organisation admin. How does the perspective change during this path?

H: I first had too much time so I coded open-source, then too little money so I coded open-source, then too much work so I mentor people coding it open-source. At some point I realised I like this stuff so much that I would like to help organising Shogun and bring together the students and scientists involved. It is great to give back to the community which played a major role for me in my studies. It is also sometimes quite amusing to get those emails by students applying, being worried about the same unimportant things that I worried about back then.

S: It seems to be quite natural actually. You could even miss the point when things change and you became a mentor. Once you are into the game things are going pretty fast. Especially if you have full-time job and studies!

Are there any (forbidden) substances that you exploit to keep up with the workload?

S: It would sound strange but I am not addicted to vodka. Although I bet Heiko is addicted to beer and sausages.

H: Coffeecoffeecoffeee…… Well, to be honest GSoC definitely reduces your sleep no matter whether you are either student, mentor, or admin. By the way, our 3.0 release was labelled: Powered by Vodka, Mate, and beer.

Do you crazy Nerds actually ever go away from your computers?

H: No.

S: Once we all met at our workshop in Berlin – but we weren’t really away from our computers. Why on earth to do that?

Any tips for upcoming members of the open-source community? For students? Mentors? Admins?

H: Students: Do GSoC! You will learn a lot. Mentors: Do GSoC! You will get a lot. Admins/Mentors: Don’t do GSoC, it ruins your health. Rather collect stamps!

S: He is kidding. (whispers: “we need this … come on … just be nice to them”)

H: Okay to be honest: just have fun of what you are doing!

Due to the missing interest in the community, Sergey and Heiko interviewed themselves on their own.

GSoC 2013 blog: http://herrstrathmann.de/shogun-blog/110-shogun-3-0.html

GSoC 2014 ideas: http://www.shogun-toolbox.org/page/Events/gsoc2014_ideas

Sergey: http://cv.lisitsyn.me/

Yeah! Shogun this week got accepted to be an organisation participating in the 10th Google Summer of Code. This year, besides mentoring a few projects, I am one of the three project administrators. I am curious how this will be. One first thing to do was to write the application for Shogun – I’m glad it worked! I also will spend a little more time organising things. Apart from trying to find mentors (which requires a lot of talking people into it), I also want to make Shogun (and the students) having more from the program. Last year, I pushed the team to ask all students

• to write a project report in the form of IPython notebooks (link). These are absolutely great for talking about the GSoC work, impressing people, and having a final piece of work to show for the students.
• To fully unit-test every module of their algorithm/framework. This is absolutely essential in order to not loose the student’s work a few years later when a re-factoring change breaks their code and nobody knows how to fix it. Those tests already saved lots of life since last year.
• To peer-review each other in pairs of students. This improved documentation here and there and solved some bugs. I want to emphasise this more this year as I think it is a great way of enabling synergistic effects between students.

In addition, we will again screen all the applicants via a set of entrance tasks on our github page (link). I just wrote a large number of such smaller or larger tasks that get students started on a particular project, fix bugs in Shogun, or prepare some larger change. In order to get the students started a bit more easily (contributing to Shogun these days is a non-trivial task), I wrote a little how-to (link) that is supposed to point out our expectations, and what are the first steps towards participating in GSoC.

Finally, I wrote descriptions for quite a few possible projects, some of them with a number of interesting co-mentors. The full list is here (link). If you are a talented student interested in any of those topics, consider working with us during the summer. It’s usually very fun!

• Variational Learning for Recommendation with Big Data. With Emtiyaz Khan, who I met at last year’s workshop for latent Gaussian models. Matrix factorisation and Gaussian Processes, ultra-cool project.
• Generic Framework for Markov Chain Monte Carlo Algorithms and Stan Interface. With Theo Papamarkou, who I know from my time at UCL Statistics. It’s about a modular representation of MCMC within Shogun and a possible interface to STAN for the actual sampling. This would be a major step of Shogun towards probabilistic models.
• Testing and Measuring Variable Interactions With Kernels. With Dino, who is post-doc at Gatsby and co-author of our optimal kernel for MMD paper. This project is to implement all kernel based interaction measures in Shogun in a unified way. We’ll probably use this for research later.
• A Meta-Language for Shogun examples. With Sören. Write example once, press button to generate in any modular language binding. This would be so useful to have in Shogun!
• Lobbying Shogun in MLPACK’s automatic benchmarking system. Joint project with Ryan from MLPACK. He already can compare speed of different toolboxes. Now let’s compare results.
• Shogun Missionary & Shogun in Education. With Sören. Write high quality notebooks and eye-candy examples. Very different project as this is about creative technical writing and illustrating methods on cool data rather than hacking new algorithms. I would be very excited if this happened!

Some of the other projects involve cool buzzwords such as Deep Learning, Structured Output, Kernel, Dual solvers, Cluster backends, etc. Join us! 🙂

GSoC 2013 brings Shogun 3.0

Shogun’s third Google Summer of Code just ended with our participation in the mentor summit at Google’s headquarter in Mountain View and the release of Shogun 3.0 (link) What a great summer! But let’s start at the beginning…

Shogun is a toolbox that offers a unified framework for data-analysis, or in buzz words: machine learning, for a broad range of data types and analysis problems. Those not only include standard tools such as regression, classification, clustering, etc, but also cutting edge techniques from recent developments in research. One of Shogun’s most unique features is its interfaces to a wide range of mainstream computing languages.

In our third GSoC, we continued most of the directions taken in previous years such as asking students to contribute code in the application process for them to be considered. For that, we created a list of smaller introductory tasks for each of the GSoC projects that would become useful later in the project. While allowing students to get used to our development process, and increasing the quality of the applications, this also pushed the projects forward a bit before GSoC even started. The number of applications did not suffer through that (57 proposals from 52 students) but even increased compared to the previous year (48 proposals from 38 students) — this seems to be a trend.

This summer, we also had former GSoC students mentoring for the first time: Sergey Lisitsyn and me (mentoring two projects). Both of us joined in 2011. In addition, the former student Fernando Iglesias participated again and former student Viktor Gal stayed around to work on Shogun during GSoC (and did some massive infrastructure improvements). These are very nice long term effects of continuous GSoC participation. Thanks to GSoC, Shogun is growing constantly both in terms of code and developers.

As in 2012, we eventually could give away 8 slots to some very talented students. All of them did an awesome job on some highly involved projects covering a large number of topics. Two projects were extensions of previous ones:

Roman Votjakov extended last year’s project on the popular Gaussian Processes for handling classification problems and Shell Hu implemented a collection of algorithms within last year’s structured output framework (for example for OCR)

Fernando Iglesias implemented a new algorithm called metric learning, which plays well together with existing methods in Shogun.

Another new algorithm came from Soumyajit De, who has implemented an estimation method for log-determinants of large sparse matrices (needed for example for large-scale Gaussian distributions), and implemented a framework for linear operators and solvers, and fundamentals of an upcoming framework for distributed computing (which is used by his algorithm) on the fly.

Evangelos Anagnostopoulos worked on feature hashing and random kitchen sinks, two very cool tricks to speed up linear and kernel-based learning methods in Shogun. Kevin Hughes implemented methods for independent component analysis, which can be used to separate mixtures of signals (for example audio, heart-beats, or images) and are well known in the community.

Last but not least, Liu Zhengyang created a pretty web-framework for running Shogun demos from the web browser and did add support for directly loading data from the mldata website. Evgeniy Andreev improved Shogun’s usability via integrating native support for various popular file formats such as CSV and protobuf.

You might have noticed the links in the above text (and images). Most of them are the final reports of the students in the form of IPython notebooks, an awesome new open-source tool that we started using for documentation. We are very proud of these.  See http://shogun-toolbox.org/page/documentation/notebook/ for a list of all notebooks. Also check out the web-demo framework at http://www.shogun-toolbox.org/page/documentation/demo/ if you haven’t yet.

IPython also features Shogun in the cloud: Former student Viktor Gal did setup http://cloud.shogun-toolbox.org which is an IPython notebook server ran by us. It allows you to play with Shogun-python from any web-browser without having to install it. You can try the existing notebooks or write your own. Give it a shot and let us know what you think!

This year’s GSoC also was the most productive one for us ever. We got  more than 2000 commits changing almost 400000 lines in more than 7000 files since our last release before GSoC.

Students! You all did a great job and we are more than amazed what you all have achieved. Thank you very much and we hope some of you will stick around.

Besides all the above individual projects, we encouraged students to work together a bit more to enable synergistic effects. One way we tried to implement this was through a peer review where we paired students to check each others interface documentation and final notebooks. We held the usual meetings with both mentors and students every few weeks to monitor progress and happiness, as well as asking students to write weekly reports. Keeping our IRC channel active every day also helped a lot in keeping things going.

My personal experience with mentoring was very positive. It is very nice to give back to the community. I tried to give them the same useful guidance that I received back then, and probably learned as much as my students did on the way. Having participated in GSoC 2011 and 2012, the change of perspective as a mentor was interesting, in particular regarding the selection process. Time wise, I think Google’s official statement of 5 hours per student per week is underestimating things quite a bit (if you want to get things done), and of course there is no upper bound on time you can spend.

Our plan of pairing external mentors with internal developers worked smoothly. As most of our mentors are scientists who tend to be very busy, it is sometimes hard for them to review all code on their own. Combining  big-picture guidance with the in-depth framework knowledge of the paired core developers allowed for more flexibility when allocating mentors for projects. Keep in mind that Shogun is still being organised by only five people (4 former students) plus a hand full of occasional developers, which makes it challenging to supervise 8 projects.

Another change this year was that writing unit-tests were mandatory to get code merged, which made the number of unit tests grew from 50 to more than 600. In the past years, we had seen how difficult it is to write tests at the end of projects, or maintain untested code. Making students do this on-the-fly drastically increased the stability of their code. A challenging side-effect of this was that many bugs within Shogun were discovered (and eventually fixed) which kept students and developers busy.

As for Shogun itself, GSoC also boosts our community of users, which became so active this year that decided to organise a the first Shogun workshop in Berlin this summer. We had something over 30 participants from all over the world. The Shogun core team also met for the first time in real life, which was nice! We had a collection of talks, discussions, and hands-on sessions. Click here and here for videos and slides.

October brought the mentor summit, which I attended for the first time. This was such a cool event! There was a hotel with hot-tub, lots of goodies on the google campus as for example an on-site barista (!), a GSoC mentor with a robot-dog, and loads of loads of interesting people from interesting open-source projects. Some of these were new to me, some of them are projects that I have been checking out for more than 10 years now.I attended a few fruitful sessions, for example on open-source software for science. Sören hang out with the people he knew from previous years and the cool Debian guys (for which he is a developer too).

After the summit, the Shogun mentor team went hiking in the south Californian desert – I even climbed a rock.

What a great summer!

GSoC 2013

Shogun got accepted in the Google Summer of Code 2013!

Check out our ideas pageThis year, I will be a mentor rather than a student  and I am very excited about this.
I’ll be offering two projects:

• Implement Gaussian process classification (joint with Oliver Stegle). This is an extension of the GSoC project last year and should be quite interested while not being too complicated (link)
• Implement unbiased estimators of likelihoods of very large, sparse Gaussian distributions (joint with Erlend Aune and Daniel Simpson). This one is quite challenging since it involved many different topics. However, it should also be very interesting (link)

Shogun is in the GSoC 2013

Shogun got accepted in the Google Summer of Code 2013!

Check out our ideas pageThis year, I will be a mentor rather than a student  and I am very excited about this.

I’ll be offering two projects:

• Implement Gaussian process classification (joint with Oliver Stegle). This is an extension of the GSoC project last year and should be quite interested while not being too complicated (link)
• Implement unbiased estimators of likelihoods of very large, sparse Gaussian distributions (joint with Erlend Aune and Daniel Simpson). This one is quite challenging since it involved many different topics. However, it should also be very interesting (link)

Shogun blog posts

GSoC 2012 is over

Since a few weeks, GSoC 2012 is over. It has been a pretty cool summer for me. As last year, I learned lots of things. This year though, my project a bit more research oriented — which is nice since it allowed me to connect my work for SHOGUN with the stuff I do in Uni. I even mentioned it in my Master’s dissertation (link) which also was about statistical hypothesis testing with the MMD. Working on the dissertation at the same time as on the GSoC was sometimes exhausting. It eventually worked out fine since both things were closely related. I would only suggest to do other important things if they are connected to the GSoC project. However, if this condition is met, things multiply in terms of the reward you get due to synergistic effects.

The other students working for SHOGUN also did very cool projects. All these are included in the SHOGUN 2.0 release (link). The project now also has a new website so its worth taking a closer look. Some of the other (really talented) guys might stay with SHOGUN as I did last year. This once more gives a major boost to development. Thanks to all those guys. I also owe thanks to Sören and Sergey who organised most things and made this summer so rewarding.

In the near future I will try to put in some extensions to the statistical testing framework that I though of during the summer but did not have time to implement: On-line features for the linear time MMD, a framework for kernel selection which includes all investigated methods from my Master’s dissertation, and finally write unit-tests using SHOGUN’s new framework for that. I will update the SHOGUN project page of my website (link). I might as well send some tweets to SHOGUN’s new twitter account (link).

11th GSoC weekly report: Done!

This will be my last weekly report for this years summer of code! Last week, I did not write a report since I have been very busy with experiments for a rebuttal for the NIPS submission (see 2nd GSoC weekly report). This week was more productive: I continued polishing the new framework for statistical tests, squeezed out some final bugs and made made a few things more effective.

I also created graphical examples for linear and quadratic time MMD and HSIC based tests. These serve the purpose of illustrating how the methods work on simple datasets. They sample the underlying statistic’s null and alternative distributions using all different methods I implemented and plot distributions with test thresholds (as well as data). For the MMD tests, the dataset contains samples from two multivariate Gaussian distributions with unit variance in every component and equal means in all but one component. The HSIC tests uses data where dependence is induced via rotation (see last report). Below are screenshots of the output of the examples.

These images were also added to the shogun-tutorial. I added a part about independence testing and corrected some mistakes in there. All methods I implemented are now contained within the tutorial. Another documentation related thing I did was to update doxygen based sourcecode documentation. In particular, I cleaned up the horrible mess in the CStatistics class — and replaced all ascii-art by LaTeX. Although there are still things to do, my project is now in the status “done” in terms of GSoC 🙂 It was a nice summer! I guess I will be extending it with some ideas that came up while working on with kernel two sample tests recently.

For the last week, I intend to get some unit-testing done and start to focus on things that are needed for our upcoming 2.0 release (Bug hunting, fix warnings, implement things that people request). I will also write an overall summary on the GSoC next month or so. Next month will be busy since I also have to finish my Master’s project.

10th GSoC weekly report: Slowly getting ready

Step by step, my project enters a final state 🙂
Last week, I added new data generation methods, which are used from a new example for independence tests with HSIC. It demonstrates that the HSIC based test is able to capture dependence which is induced by rotating data that has zero correlation — one of the problems from the paper [1]. Here is a picture; the question is: are the two dimensions dependent? Or moreover, is a test able to capture that? (correlation is almost zero, dependence is induced via rotation)

I also realised that my current class structure had problems doing bootstrapping for HSIC, so I re-factored a bit. Bootstrapping is now also available for HISC using the same code that does it for two-sample-tests. I also removed some redundancy — both independence and two-sample tests are very similar problems and implementations should share code where possible.

Another thing that was missing so far is to compute test thresholds; so far, only p-values could be computed. Since different people have different tastes about this, I added both methods. Checking a test statistic against a threshold is straight-forward and gives a binary answer; computing a p-value gives the position of the test statistic in the null-distribution — this contains more information. To compute thresholds, one needs the inverse CDF function for the null-distribution. In the bootstrapping case, it is easy since simply the sample that corresponds to a certain quantile has to be reported. For cases where a normal- or gamma-distribution was fitted, I imported some more routines from the nice ALGLIB toolbox.

For this week, I plan to continue with finishing touches, documentation, examples/tests, etc. Another idea I had is to make the linear time MMD test work with SHOGUN’s streaming features, since the infinite or streaming data case is the main area for its usage.

[1]: Gretton, A., Fukumizu, K., Teo, C., & Song, L. (2008). A kernel statistical test of independence. Advances in Neural Information Processing Systems