Shogun 4.0 and GSoC 2014 follow up

No, this is not about Fernando’s and mine honeymoon …

The Shogun team just released version 4.0 of their community driven Machine Learning toolbox. This release most of all features the work of our 8 Google Summer of Code 2014 students, so this blog post is dedicated to them — you guys rock. This also brings an end to yet another very active year of Shogun: we organised a second workshop in Berlin, and I presented Shogun in to the public in London, New York and Berlin.

For the 4th time, Shogun participated in Google’s wonderful program which more than anything boosted the team’s size and motivation. What else makes people spend sleepless nights hunting bugs for the sake of Machine Learning for everyone? This year was the first time that I organised our participation. This ranged from writing the application last second, harrassing potential mentors until they say ‘yes’, to making up overly ambitious projects to scare students away, and ending up mentoring too many students on my own. Jokes aside, this was a very challenging (in particular time-wise) but also a very rewarding experience that definitely sharpened my project organisation skills. As in the previous year, I tried to fuse my scientific life and Shogun’s GSoC participation — kernel methods and variational learning is something I touch on a daily basis at Gatsby. Many mentors were approached after having met them at scientific Machine Learning conferences, and being exposed to ML on for some years now, it is also easier to help students implement and write about popular ML algorithms.
Here is a list. (Note that all projects come with really nice IPython notebooks — something that we continued to insist on from last year.)

Fundamental ML algorithms by Parijat Mazumdar (parijat). Mentor: Fernando
Shogun needs more standard ML algorithms. Parijat implemented some of those: random forests, kernel density estimation and more. Parijat’s code quality is amazing and together with Fernando’s superb mentoring skills (his first time mentoring), this project is likely to have been very sustainable.
Notebook random forest, notebook KDE.

Kernel testing and feature selection by Rahul De (lambday). Mentor: Dino Sejdinovic, Heiko
Previous year’s student lambday continued to rock. First, he massively extended my 2012 project on kernel hypothesis testing to Big Data land. Dino, who was one of the invited speakers in the Shogun workshop last summer, and I are actually working on a journal article where we will use this code. Second, he extended the framework to perform feature selection via dependence measures. Third, he initiated and guided development of a framework for unifying Shogun’s linear algebra operations. This for example can be used to change existing algorithms from CPU to GPU with a compile switch — useful also for our deep learning project.

Variational Inference for Gaussian Processes by Wu Lin (yorkerlin). Mentor:Heiko, Emtiyaz Khan
In our third GSoC project on GPs, Wu took a couple of state-of-the-art approximate variational inference methods developed by Emiyaz, and put them into Shogun’s framework. The result of this very involved and technical project is that we now have large-scale classification using GPs. Emtiyaz also was a speaker at ourworkshop.
Notebook

Shogun missionary by Saurabh. Mentor:Heiko
The idea of this project was to showcase Shogun’s abilities — sometimes we definitely need to work on. Saurabh wrote a couple of Notebooks that are essentially ML tutorials using Shogun. If you want to know about ML basics, regression, classification, model-selection, SVMs, multiclass, multiple kernel learning this was for you. He also extended our web-demo framework to for example include model-selection for GPs.

OpenCV integration by Kislay. Mentor: Kevin
Kislay, after writing a very cool notebook on PCA for his application, wrote data-structures to bridge between Shogun and OpenCV. The project was supervised by Kevin, who is also one of our former GSoC students This makes it possible to use the too libraries together in a neat way.

Deep learning by Khaled Nasr. Mentor: Theofanis, Sergey
The hype is on! After¬†NIPS, Facebook, GoogleDeepmind, Shogun now also joined ūüėČ Khaled did a very good job in coding up the standard ones, and was involved in generalising Shogun’s linear algebra on the fly with lambday. This is a project that is likely to have a second part. Check his superb notebooks on deep belief neural networks, convolutional networks,autoencoders, restricted Boltzman machines

SO Learning with Approximate Inference by Jiaolong. Mentor: hushell, Thoralf

This was another project that was (co-)mentored by a former GSoC student. With the help his mentors, Jialong implemented various approximate inference methods for structured output (SO) models. Check out his notebook.

 

Large-Scale Multi-Label Classification by Abinash. Mentor: Thoralf
Another project involving our structured output expert Thoralf as mentor. Abinash implemented large-scale multilabel learning — beating scikit learn‘s implementation both in runtime and accuracy. The last experiment is described in this notebook.

Finally, we sent two of our delegates (Thoralf and Fernando) to the 10th year jubilee mentor summit in California in late October. Really cool: I got lucky and won Google’s lottery on some extra places, so I could also join. The summit once again was overly colourful, bursting with creative minds who have the most diverse set of opinions and approaches, but who are all united by their excitement about open-source. The beauty of this community to me really lies in the people who do work purely driven by their interest on *the thing itself*, independent of competitive and in particular commercial interests — sometimes almost to an extend that is beyond any form of compromise. A wonderful illustration of this was when at the mentor summit, during the reception in the Tech museum in San Jose: Google’s speaker and head of finance Patrick Pichette (disclaimer: not sure, don’t quote me on this) who is the boss of Chris DiBona, who himself organises the GSoC, searched to inspire the audience to “think BIG” and to “change the lifes of GSoC students”. Guest speaker Linus Torvalds 10 minutes later then contemplates that he could not be a GSoC mentor as he would scare people away and that the best way to get involved in open-source is to “start small” — a sentence after which P.P. left the room. Funny enough: in GSoC, this community is then hugged by a super capitalistic American internet company — and gladly lets it happen: we all love GSoC and Shogun certainly would not be where it is without it. I also want to mention the day Google rented a whole theme park for us nerds — which made Fernando try a roller-coaster for the first time after being pushed by MLPack maintainer Ryan and myself. After being horrified at first, he even started to talk about C++ the second or third time.

As you would expect from attending such geeky meetings, Thoralf, Fernado, and I also spent quite some time hacking Shogun, discussing ideas until late night (of course getting emotional about them ūüôā ). I managed to take a picture of Fernando falling asleep while hacking Shogun’s modular interfaces. Some of those ideas are collected on our wiki.

  • Improve usability
  • Making Shoun more modular and slim
  • Improving Shogun’s efficiency

Some of those ideas are also part of our theme for our GSoC 2015 application and our planned Hackathon. We have come to a point where we seriously need to focus on application and stability rather than adding more and more cutting-edge algorithms — Shogun’s almost 15 year old framework needs a face lift. GSoC students will see that this years project ideas will focus on cleaning up the toolbox and implement ML applications.

Meet the Shogun/MLPack crew, as nerdy as it gets ūüėČ

Shogun in NYC

In late August, I was invited to NYC to present Shogun at an open-source Machine Learning software workshop (link), organised by John Langford. Seeing Shogun being recognised as a major player along with big ML/stats libraries like Theano, Stan, Torch, LibLinear, VowpalWabbit, etc really got my excited.

I talked to most of the other project’s developers and a few very interesting possible collaborations came up. For example, Shogun’s unique way to automagically generate interfaces to most computing languages via¬†swig. It has been in our pipeline of ideas for a long time to pull this functionality out of Shogun, and offer it to other projects in a modular way. Sergey has put together (link) a simple prototype here and will continue to work on this soon.

Another thing is that we would like to integrate some (fixed) models from Stan into Shogun, for example to complement our collection of variational inference methods for Gaussian Processes with a full blown MCMC based approach.

While talking to¬†Gunnar Raetsch, we had the idea to host a hackfest where we bring together all Shogun core-developers for a week, working on more sophisticated projects that are not suitable for¬†Google Summer of Code. A generic framework for parallelising/distributing algorithms in Shogun would be a first idea, extending ideas from¬†lambday’s¬†GSoC 2013 project. This would again also be useful for other ML libraries and I in fact talked to John Langford about using ideas from¬†VowpalWabbit¬†here. We made a list of ideas (link) for such a hack and are currently trying to get funding for it.

The meeting was video-taped and I will put links for this soon.

Shogun workshop 2014

Another super nice event that happened in Berlin in July was the second Shogun workshop that me and the Shogun team organised. We had a hands-on session and a main workshop day full of talks, cool people, and lots of Machine Learning. It is a great pleasure working with the Shogun team, and I am very happy that we were able to push this year’s workshop through.

I talked a bit about what Shogun is and what it tries to be, and gave some directions for future work. If you missed the event, we got full video coverage!

Workshop program

Videos

Shogun at EuroPython 2014

It’s already a while ago, in July, I presented Shogun at the EuroPython in Berlin. This was a super nice conference full of interesting people and projects. My talk was recorded and can be found [here], slides are [here].

Another super interesting project was presented by Thomas Wiecki: Probabilistic programming with PyMC3 [video]. This is a very interesting project, a bit similar to Stan, which allows to plug together probabilistic models and then do HMC for inference, powered by automatic differentiation (which I am currently super excited about). Unlike Stan, this happens all in Python which makes things super comfortable (example notebook). As I am working on kernel-based samplers (Kameleon MCMC & soon more), this might be a good vehicle to make them public, exploiting the nice auto-diff tool (Theano) that PyMC uses.

I Like Intractable Likelihoods

Last week, I went to the i-like workshop at Oxford university. Pretty cool! All of Britain’s statisticians were there and I met many of them for the first time. Check out my two posters (Russian Roulette, Kernel Adaptive Metropolis Hastings). Talks were amazing – as in last NIPS, the main trend is on estimating likelihoods (well, that’s the name of the program), either using some other random process such as importance sampling a latent model’s marginal likelihood (aka Pseudo-Marginal MCMC), or directly sub-sampling likelihoods or gradients.

These things are important in Machine Learning too, and it is very nice to see the field growing together (even-though there was a talk by a Statistician spending lots of time on re-inventing belief propagation and Junction tree ideas – always such a pitty if this happens simply because communities do not talk to each other enough). Three talks that I really found interesting:

Remi Bardenet talked about sub-sampling approaches to speed up MCMC. This is quite related to the Austerity in MCMC land paper by Welling & Co, with the difference that his tests do not suffer from small number of points in the hypothesis test to decide accept/reject.

Chris Sherlock talked about optimal rates and scaling for Pseudo-Marginal MCMC. There finally are some nice heuristics how to scale PM estimates in a way that the number of iid samples per computation time is optimal. Interestingly, the acceptance rate and the variance of the likelihood estimate can be tweaked separately.

Jim Griffin gave a very interesting talk on adaptive MCMC on discrete, in particular binary, state-spaces – he used them for feature selection (in ML language). His algorithm automatically learns global mutations rates for each of the positions. However, it doesn’t take any correlations between the features into account. This might be a very interesting application for our fancy Kameleon sampler (arxiv, code), thinking about this!

Finally, I presented two posters, the one on Playing Russian Roulette with Intractable Likelihoods that I already presented in Reykjavik, and (with Dino) a new poster (link) on the Kernel Adaptive Metropolis Hastings Kameleon that I mentioned above. The corresponding paper is hopefully published very soon. Talking to other scientists about my own work is just great!

 

MLOSS workshop at NIPS 2013

Last week, I went to the Advances in Neural Information Processing Systems (NIPS) for the first time. That was a very nice experience due to the incredibly density of people whose names I know from research papers. In fact, it was too much to take so I had to pick things that sounded interesting – still loads.

The main three buzzwords of the conference for me were: Deep Learning (even Mark Zuckerberg is interested in that these days), Mini-batch, and stochastic gradient descent (aka on-line whatever).

One very interesting workshop I attended on Tuesday was on Machine Learning Open-Source Software (MLOSS), organised by Cheng Soon Ong (who could not be there unfortunately) and Antti Honkela. I presented a short spotlight for Shogun (slide) and had a one hour demo, showing off with our cool IPython notebooks (link) and the cloud Shogun server (link). I got some very encouraging feedback for this, including from Fernando Perez.
I also met a few nice fellow open-source ML coders from scikit-learn.

During the workshop, there was a quite lively discussion about licensing issues, in particular whether to choose GPL or BSD. The python universe for example seems to gain a lot from being BSD-style licensed.

Finally, NIPS is was held close to Lake Tahoe, which is surrounded by incredibly beautiful mountains to hike in. One evening, I met the guy who left those traces … very exciting, slightly scary…

GSoC 2013 brings Shogun 3.0

Shogun’s third Google Summer of Code just ended with our participation in the mentor summit at Google’s headquarter in Mountain View and the release of Shogun 3.0 (link) What a great summer! But let’s start at the beginning…

Shogun is a toolbox that offers a unified framework for data-analysis, or in buzz words: machine learning, for a broad range of data types and analysis problems. Those not only include standard tools such as regression, classification, clustering, etc, but also cutting edge techniques from recent developments in research. One of Shogun’s most unique features is its interfaces to a wide range of mainstream computing languages.

In our third GSoC, we continued most of the directions taken in previous years such as asking students to contribute code in the application process for them to be considered. For that, we created a list of smaller introductory tasks for each of the GSoC projects that would become useful later in the project. While allowing students to get used to our development process, and increasing the quality of the applications, this also pushed the projects forward a bit before GSoC even started. The number of applications did not suffer through that (57 proposals from 52 students) but even increased compared to the previous year (48 proposals from 38 students) — this seems to be a trend.

This summer, we also had former GSoC students mentoring for the first time: Sergey Lisitsyn and me (mentoring two projects). Both of us joined in 2011. In addition, the former student Fernando Iglesias participated again and former student Viktor Gal stayed around to work on Shogun during GSoC (and did some massive infrastructure improvements). These are very nice long term effects of continuous GSoC participation. Thanks to GSoC, Shogun is growing constantly both in terms of code and developers.

As in 2012, we eventually could give away 8 slots to some very talented students. All of them did an awesome job on some highly involved projects covering a large number of topics. Two projects were extensions of previous ones:

 

Roman Votjakov extended last year’s project on the popular Gaussian Processes for handling classification problems and Shell Hu implemented a collection of algorithms within last year’s structured output framework (for example for OCR)


Fernando Iglesias implemented a new algorithm called metric learning, which plays well together with existing methods in Shogun.


Another new algorithm came from Soumyajit De, who has implemented an estimation method for log-determinants of large sparse matrices (needed for example for large-scale Gaussian distributions), and implemented a framework for linear operators and solvers, and fundamentals of an upcoming framework for distributed computing (which is used by his algorithm) on the fly. 


Evangelos Anagnostopoulos worked on feature hashing and random kitchen sinks, two very cool tricks to speed up linear and kernel-based learning methods in Shogun. Kevin Hughes implemented methods for independent component analysis, which can be used to separate mixtures of signals (for example audio, heart-beats, or images) and are well known in the community.


Last but not least, Liu Zhengyang created a pretty web-framework for running Shogun demos from the web browser and did add support for directly loading data from the mldata website. Evgeniy Andreev improved Shogun’s usability via integrating native support for various popular file formats such as CSV and protobuf.

 

 

You might have noticed the links in the above text (and images). Most of them are the final reports of the students in the form of IPython notebooks, an awesome new open-source tool that we started using for documentation. We are very proud of these. ¬†See¬†http://shogun-toolbox.org/page/documentation/notebook/ for a list of all notebooks. Also check out the web-demo framework at¬†http://www.shogun-toolbox.org/page/documentation/demo/¬†if you haven’t yet.

IPython also features Shogun in the cloud: Former student Viktor Gal did setup http://cloud.shogun-toolbox.org which is an IPython notebook server ran by us. It allows you to play with Shogun-python from any web-browser without having to install it. You can try the existing notebooks or write your own. Give it a shot and let us know what you think!

This year’s GSoC also was the most productive one for us ever. We got  more than 2000 commits changing almost 400000 lines in more than 7000 files since our last release before GSoC.

Students! You all did a great job and we are more than amazed what you all have achieved. Thank you very much and we hope some of you will stick around.

Besides all the above individual projects, we encouraged students to work together a bit more to enable synergistic effects. One way we tried to implement this was through a peer review where we paired students to check each others interface documentation and final notebooks. We held the usual meetings with both mentors and students every few weeks to monitor progress and happiness, as well as asking students to write weekly reports. Keeping our IRC channel active every day also helped a lot in keeping things going.

My personal experience with mentoring was very positive. It is very nice to give back to the community. I tried to give them the same useful guidance that I received back then, and probably learned as much as my students did on the way. Having participated in GSoC 2011 and 2012, the change of perspective as a mentor was interesting, in particular regarding the selection process. Time wise, I think Google’s official statement of 5 hours per student per week is underestimating things quite a bit (if you want to get things done), and of course there is no upper bound on time you can spend.

Our plan of pairing external mentors with internal developers worked smoothly. As most of our mentors are scientists who tend to be very busy, it is sometimes hard for them to review all code on their own. Combining  big-picture guidance with the in-depth framework knowledge of the paired core developers allowed for more flexibility when allocating mentors for projects. Keep in mind that Shogun is still being organised by only five people (4 former students) plus a hand full of occasional developers, which makes it challenging to supervise 8 projects.

Another change this year was that writing unit-tests were mandatory to get code merged, which made the number of unit tests grew from 50 to more than 600. In the past years, we had seen how difficult it is to write tests at the end of projects, or maintain untested code. Making students do this on-the-fly drastically increased the stability of their code. A challenging side-effect of this was that many bugs within Shogun were discovered (and eventually fixed) which kept students and developers busy.

 As for Shogun itself, GSoC also boosts our community of users, which became so active this year that decided to organise a the first Shogun workshop in Berlin this summer. We had something over 30 participants from all over the world. The Shogun core team also met for the first time in real life, which was nice! We had a collection of talks, discussions, and hands-on sessions. Click here and here for videos and slides.

October brought the mentor summit, which I attended for the first time. This was such a cool event! There was a hotel with hot-tub, lots of goodies on the google campus as for example an on-site barista (!), a GSoC mentor with a robot-dog, and loads of loads of interesting people from interesting open-source projects. Some of these were new to me, some of them are projects that I have been checking out for more than 10 years now.I attended a few fruitful sessions, for example on open-source software for science. Sören hang out with the people he knew from previous years and the cool Debian guys (for which he is a developer too).

After the summit, the Shogun mentor team went hiking in the south Californian desert – I even climbed a rock.

What a great summer!

 

 

 

 

 

 

 

 

 

 

Shogun Workshop 2013

Last weekend, our Shogun workshop finally took place in Berlin. It was really cool to meet all those guys in person. We have been working together for quite some time now. The core-team an Shogun’s supporters are absolutely awesome. It is great to be part of that.

We had a nice afternoon at c-base (who were so friendly to host us) with some talks by all of our developers, followed by two days of hands-on workshop at the TU-Berlin.

I gave a little talk on two random things you can do with kernels (that are completely unrelated): Gaussian Processes and the kernel MMD. Slides are (download). I also wrote some IPython notebooks for GP-regression (link), GP-probit-classification (link), and two-sample testing with the kernel MMD (link).
One of the results of our discussions was that we will start using those notebook for Shogun’s documentation as they allow to combined code, plots, and maths in a web-based viewer.

Finally, here are some picture of us, (pretty nerdy)