# Herr Strathmann.

## I am interested in:

• Machine Learning and Computational Statistics, in particular
• Kernel Methods
• Bayesian Statistics and MCMC
• Bioinformatics (a little)
• Open-source software

## A list of my publications:

Marianne Ruhl, Patrick Chhatwal, Heiko Strathmann, Thomas Kuntzen, Dorothea Bankwitz, Kathrin Skibbe, Andreas Walker, Falko M Heinemann, Peter Horn, Todd M Allen, Daniel Hoffmann, Thomas Pietschmann and Jörg Timm. Escape from a dominant HLA-B*15-restricted CD8+ T cell response against hepatitis C virus requires compensatory mutations outside the epitope.. Journal of virology (November), 2011. URL BibTeX

@article{Ruhl2011,
author = {Ruhl, Marianne and Chhatwal, Patrick and Strathmann, Heiko and Kuntzen, Thomas and Bankwitz, Dorothea and Skibbe, Kathrin and Walker, Andreas and Heinemann, Falko M and Horn, Peter a and Allen, Todd M and Hoffmann, Daniel and Pietschmann, Thomas and Timm, J\"{o}rg},
doi = "10.1128/JVI.05603-11",
issn = "1098-5514",
journal = "Journal of virology",
month = "",
number = "November",
pmid = 22072759,
title = "{Escape from a dominant HLA-B*15-restricted CD8+ T cell response against hepatitis C virus requires compensatory mutations outside the epitope.}",
url = "http://www.ncbi.nlm.nih.gov/pubmed/22072759",
year = 2011
}


Arthur Gretton, Bharath Sriperumbudur, Dino Sejdinovic, Heiko Strathmann, Sivaraman Balakrishnan, Massimiliano Pontil and Kenji Fukumizu. Optimal kernel choice for large-scale two-sample tests. In Advances in Neural Information Processing Systems. 2012. BibTeX

@inproceedings{Gretton2012a,
author = "Gretton, Arthur and Sriperumbudur, Bharath and Sejdinovic, Dino and Strathmann, Heiko and Balakrishnan, Sivaraman and Pontil, Massimiliano and Fukumizu, Kenji",
booktitle = "Advances in Neural Information Processing Systems",
title = "{Optimal kernel choice for large-scale two-sample tests}",
year = 2012
}


Dino Sejdinovic, Heiko Strathmann, Maria Lomeli Garcia, Christophe Andrieu and Arthur Gretton. Kernel Adaptive Metropolis-Hastings. In International Conference Machine Learning. 2014. URL BibTeX

@inproceedings{Sejdinovic2014kernel,
author = "Sejdinovic, Dino and Strathmann, Heiko and Garcia, Maria Lomeli and Andrieu, Christophe and Gretton, Arthur",
booktitle = "International Conference Machine Learning",
url = "http://jmlr.org/proceedings/papers/v32/sejdinovic14.html",
year = 2014
}


## ICML paper: Kernel Adaptive Metropolis Hastings

$\DeclareMathOperator*{\argmin}{arg\,min}$ $\def\Mz{M_{\mathbf{z},y}}$

In this year's ICML in Bejing, Arthur Gretton, presented our paper on Kernel Adaptive Metropolis Hastings [link, poster]. His talk slides are based on Dino Sejdinovic ones [link]. Pitty that I did not have travel funding. The Kameleon is furious!

The idea of the paper is quite neat: It is basically doing Kernel-PCA to adapt a random walk Metropolis-Hastings sampler's proposal based on the Markov chain's history. Or, more fancy: Performing a random walk on an empirically learned non-linear manifold induced by an empirical Gaussian measure in a Reproducing Kernel Hilbert Space (RKHS). Sounds scary, but it's not: the proposal distribution ends up being a Gaussian aligned to the history of the Markov chain.

We consider the problem of sampling from an intractable highly-nonlinear posterior distribution using MCMC. In such cases, the thing to do (tm) usually is Hamiltonian Monte Carlo (HMC) (with all its super fancy extensions on Riemannian manifolds etc). The latter really is the only way to sample from complicated densities by using the geometry of the underlying space. However, in order to do so, one needs access to target likelihood and gradient.

We argue that there exists a class of posterior distributions for that HMC is not available, while the distribution itself is still highly non-linear. You get them as easy as attempting binary classification with a Gaussian Process. Machine Learning doesn't get more applied. Well, a bit: we sample the posterior distribution over the hyper-parameters using Pseudo-Marginal MCMC. The latter makes both likelihood and gradient intractable and the former in many cases has an interesting shape. See this plot which is the posterior of parameters of an Automatic Relevance Determination Kernel on the UCI Glass dataset. This is not an exotic problem at all, but it has all the properties we are after: non-linearity and higher order information unavailable. We cannot do HMC, so we are left with a random walk, which doesn't work nicely here due to the non-linear shape.

The idea of our algorithm is to do a random walk in an infinite dimensional space. At least almost.

• We run some method to get a rough sketch of the distribution we are after.
• We then embed these points into a RKHS; some infinite dimensional mean and covariance operators are going on here, but I will spare you the details here.
• In the RKHS, the chain samples are Gaussian, which corresponds to a manifold in the input space, which in some sense aligns with what we have seen of the target density yet.
• We draw a sample $f$ from that Gaussian - random walk in the RKHS
• This sample is in an infinite dimensional space, we need to map it back to the original space, just like in Kernel PCA
• It turns out that is hard (nice word for impossible). So let's just take a gradient step along some cost function that minimises distances in the way we want: $\argmin_{x\in\mathcal{X}}\left\Vert k\left(\cdot,x\right)-f\right\Vert _{\mathcal{H}}^{2}$
• Cool thing: This happens in input space and gives us a point whose embedding is in some sense close to the RKHS sample.
• Wait, everything is Gaussian! Let's be mathy and integrate the RKHS sample (and the gradient step) out.

Et voilà, we get a Gaussian proposal distribution in input space$q_{\mathbf{z}}(\cdot|x_{t})=\mathcal{N}(x_{t},\gamma^{2}I+\nu^{2}M_{\mathbf{z},x_{t}}HM_{\mathbf{z},x_{t}}^{\top}),$where $\Mz=2\eta\left[\nabla_{x}k(x,z_{1})|_{x=y},\ldots,\nabla_{x}k(x,z_{n})|_{x=y}\right]$ is based on kernel gradients (which are all easy to get). This proposal aligns with the target density. It is clear (to me) that using this for sampling is better than just an isotropic one. Here are some pictures of bananas and flowers.

The paper puts this idea in the form of an adaptive MCMC algorithm, which learns the manifold structure on the fly. One needs to be careful about certain details, but that's all in the paper. Compared to existing linear adaptive samplers, which adapt to the global covariance structure of the target, our version adapts to the local covariance structure. This can all also be mathematically formalised. For example, for the Gaussian kernel, the above proposal covariance becomes $\left[\text{cov}[q_{\mathbf{z}(\cdot|y)}]\right]_{ij} = \gamma^{2}\delta_{ij} + \frac{4\nu^{2}}{\sigma^{4}}\sum_{a=1}^{n}\left[k(y,z_{a})\right]^{2}(z_{a,i}-y_{i})(z_{a,j}-y_{j}) +\mathcal{O}(n^{-1}),$ where the previous points $z_a$ influence the covariance, weighted by their similarity $k(y,z_a)$ to current point $y$.

Pretty cool! We also have a few "we beat all competing methods" plots, but I find those retarded and will spare them here - though they are needed for publication ;)

Dino and me have also written a pretty Python implementation (link) of the above, where the GP Pseudo Marginal sampling is done with Shogun's ability to importance sample marginal likelihoods of non-conjugate GP models (link). Pretty cool!

## Russian Roulette for intractable Likelihoods

$\def\mm#1{\boldsymbol{#1}} \DeclareMathOperator{\tr}{tr}$

While I was working at UCL's Statistics Department in winter, I got involved into a very exciting project in the group of Mark Girolami. It is based around the Pseudo-Marginal Metropolis-Hastings algorithm. In 2003, a Genetics paper [1] described an approach to sample a distribution using the standard Metropolis-Hastings algorithm when the density function is not available by simply replacing it with an unbiased estimate.

For a standard Bayesian inference problem with likelihood $\pi(y|\theta)$, prior $\pi(\theta)$, and a proposal $Q$, rather than using the standard M-H ratio $\frac{\pi(y|\theta^{\text{new}})}{\pi(y|\theta)}\times\frac{\pi(\theta^{\text{new}})}{\pi(\theta)}\times \frac{Q(\theta|\theta^{\text{new}})}{Q(\theta^{\text{new}}|\theta)},$ the likelihood is replaced by an unbiased estimator as

$\frac{\hat{\pi}(y|\theta^{\text{new}})}{\hat{\pi}(y|\theta)}\times\frac{\pi(\theta^{\text{new}})}{\pi(\theta)}\times \frac{Q(\theta|\theta^{\text{new}})}{Q(\theta^{\text{new}}|\theta)}.$ Remarkably  the resulting Markov chain converges to the same posterior distribution as the exact algorithm. The approach was later formalised and popularised in [2].

In our project, we exploited this idea to perform inference over models whose likelihood functions are intractable. Example of such intractable likelihoods are for example Ising models or, even simpler, very large Gaussian models. Both of those models' normalising constants are very hard to compute. We came up with a way of producing unbiased estimators for the likelihoods, which are based on writing likelihoods as an infinite sum, and then truncating it stochastically.

Producing unbiased estimators for the Pseudo-Marginal approach is a very challenging task. Estimates have to be strictly positive. This can be achieved via pulling out the sign of the estimates in the final Monte-Carlo integral estimate and add a correction term (which increases the variance of the estimator). This problem is studied under the term Sign problem. The next step is to write the likelihood function as an infinite sum. In our paper, we do this for a geometrically titled correction of a biased estimator obtained by an approximation such as importance sampling estates, upper bounds, or deterministic approximations, and for likelihoods based on the exponential function.

I in particular worked on the exponential function estimate. We took a very nice example from spatial statistics: a worldwide grid of ozone measurements from a satellite that consists of a about 173,405 measurements. We fitted a simple Gaussian model whose covariance matrices are massive (and sparse). In such models of the form $\log (\mathcal{N}_x(\mu,\Sigma))=-\log(\det(\Sigma)) - (\mu-x)^T \Sigma^{-1}(\mu-x) + C,$ the normalising constant involves a log-determinant of such a large matrix. This is impossible using classical methods such as Cholesky factorisation $\Sigma=LL^T \Rightarrow \log(\det(\Sigma))=2\sum_i\log(L_{ii}),$ due to memory shortcomings: It is not possible to store the Cholesky factor $L$ since it is not in general sparse. We therefore constructed an unbiased estimator using a very neat method based on graph colourings and Krylov methods from [3].

This unbiased estimator of the log-likelihood is then turned into a (positive) unbiased estimator of the likelihood itself via writing the exponential function as an infinite series $\exp(\log(\det(\Sigma)))=1+\sum_{i=1}^\infty \frac{\log(\det(\Sigma))^i}{i!}.$

We then construct an unbiased estimator of this series by playing Russian Roulette: We evaluate the terms in the series and plug in a different estimator for $\log(\det(\Sigma))$ for every $i$; once those values are small, we start flipping a coin every whether we continue the series or not. If we do continue, we add some weights that ensure unbiasedness. We also ensure that it is less likely to continue in every iteration so that the procedure eventually stops. This basic idea (borrowed from Physics papers from some 20 years ago) and some technical details and computational tricks then give an unbiased estimator of the likelihood of the log-determinant of our Gaussian model and can therefore be plugged into Pseudo-Marginal M-H. This allows to perform Bayesian inference over models of sizes where it has been impossible before.

More details can be found on our project page (link, see ozone link), and in our paper draft on arXiv (link). One of my this year's Google summer of Code projects for the Shogun Machine-Learning toolbox is about producing a sophisticated implementation of log-determinant estimators (link). Pretty exciting!

[1]: Beaumont, M. A. (2003). Estimation of population growth or decline in genetically monitored populations. Genetics 164 1139–1160.
[2]: Andrieu, C., & Roberts, G. O. (2009). The pseudo-marginal approach for efficient Monte Carlo computations. The Annals of Statistics, 37(2), 697–725.
[3]: Aune, E., Simpson, D., & Eidsvik, J. (2012). Parameter Estimation in High Dimensional Gaussian Distributions.

## NIPS paper: Optimal kernel choice for large-scale two-sample tests

NIPS 2012 is already over. Unfortunately, I could not go due to the lack of travel funding. However, as mentioned a few weeks ago, I participated in one paper which is closely related to my Master's project with Arthur Gretton and Massi Pontil. Optimal kernel choice for large-scale two-sample tests. We recently set up a page for the paper where you can download my Matlab implementation of the paper's methods. Feel free to play around with that. I am currently finishing implementing most methods into the SHOGUN toolbox. We also have a poster which was presented at NIPS. See below for all links.
Update: I have completed the kernel selection framework for SHOGUN, it will be included in the next release. See the base class interface here. See an example to use it: single kernel (link) and combined kernels (link). All methods that are mentioned in the paper are included. I also updated the shogun tutorial (link).

At its core, the paper describes a method for selecting the best kernel for two-sample testing with the linear time MMD. Given a kernel $k$ and terms
$h_k((x_{i},y_{i}),((x_{j},y_{j}))=k(x_{i},x_{i})+k(y_{i},y_{i})-k(x_{i},y_{j})-k(x_{j},y_{i}),$
the linear time MMD is their empirical mean,

$\hat\eta_k=\frac{1}{m}\sum_{i=1}^{m}h((x_{2i-1},y_{2i-1}),(x_{2i},y_{2i})),$
which is a linear time estimate for the squared distance of the mean embeddings of the distributions where the $x_i, y_i$ come from. The quantity allows to perform a two-sample test, i.e. to tell whether the underlying distributions are different.

Given a finite family of kernels $\mathcal{K}$, we select the optimal kernel via maximising the ratio of the MMD statistic by a linear time estimate of the standard deviation of the terms
$k_*=\arg \sup_{k\in\mathcal{K}}\frac{ \hat\eta_k}{\hat \sigma_k},$
where $\hat\sigma_k^2$ is a linear time estimate of the variance $\sigma_k^2=\mathbb{E}[h_k^2] - (\mathbb{E}[h_k])^2$ which can also be computed in linear time and constant space. We give a linear time and constant space empirical estimate of this ratio. We establish consistency of this empirical estimate as
$\left| \sup_{k\in\mathcal{K}}\hat\eta_k \hat\sigma_k^{-1} -\sup_{k\in\mathcal{K}}\eta_k\sigma_k^{-1}\right|=O_P(m^{-\frac{1}{3}}).$

In addition, we describe a MKL style generalisation to selecting weights of convex combinations of a finite number of baseline kernels,
$\mathcal{K}:=\{k : k=\sum_{u=1}^d\beta_uk_u,\sum_{u=1}^d\beta_u\leq D,\beta_u\geq0, \forall u\in\{1,...,d\}\},$
$\min \{ \beta^T\hat{Q}\beta : \beta^T \hat{\eta}=1, \beta\succeq0\},$
where $\hat{Q}$ is the positive definite empirical covariance matrix of the $h$ terms of all pairs of kernels.

We then describe three experiments to illustrate

• That our criterion outperforms existing methods on synthetic and real-life datasets which correspond to hard two-sample problems
• Why multiple kernels can be an advantage in two-sample testing

Supplementary page

## Master's dissertation: Adaptive Large-Scale Kernel Two-Sample Testing

I recently finished working on my Master project here at UCL. It was supervised by Arthur Gretton and Massimiliano Pontil and was about kernel selection for a linear version of the Maximum Mean Discrepancy, a kernel based two-sample test. Download report here.

Given sets of samples of size m from two probability distributions, a two-sample test decides whether the distributions are the same or different with some confidence level. The linear time MMD statistic is defined as

$\text{MMD}_l^2=\frac{1}{m}\sum_{i=1}^{m}h((x_{2i-1},y_{2i-1}),(x_{2i},y_{2i}))$
where
$h((x_{i},y_{i}),((x_{j},y_{j}))=k(x_{i},x_{i})+k(y_{i},y_{i})-k(x_{i},y_{j})-k(x_{j},y_{i})$
and $k$ is an RKHS reproducing kernel (I used the Gaussian kernel only).
A two sample test works simply as this: if this statistic (computed on sample data) is larger than a computed threshold (also on data), it is likely that the two sets of samples are from different distributions.

(Null and alternative distributions of MMD statistic, red area represents type I error, blue area represents type II error)

The linear time version is highly interesting for large-scale problems since one does not have to store data in order to compute it. Instead, it is possible to compute statistic and threshold in an on-line way.

The work contains three main contributions:

1. An on-line version for the already existing linear time two-sample test. More important, it was shown in experiments that in some situations, the linear time test is a better choice than the current quadratic time MMD state-of-the-art method. This for example happens when problems are so hard that the amount of data necessary to solve them does not fit into computer memory. On the blobs dataset described in the work,

a quadratic time test on the maximum processable amount of data reached a bad type II error while with the linear time version and much more data, almost zero type II error could be reached. Another case is when simply infinite data (but finite computation time) is available: the (adaptive) linear time test reaches lower type II error that its quadratic counterpart.

2. A description of a criterion that can be used for kernel selection for the linear time MMD. Null and alternative distribution of the statistic have appealing properties that can be exploited in order to select the optimal kernel in the sense that a test's type II error is minimised. The criterion is the ratio of MMD statistic and its standard deviation. This pulls null and alternative distribution apart while minimising their variance.

In experiments, this criterion performed better or at least equal than existing methods for kernel selection. This is especially true when the length scale at which probability distributions differ is different to the overall length scale, as for example in the above shown blobs dataset.

(Opt and MaxRat methods are based on the criterion and perform better than existing methods, X-Val-Type II is another newly described method, blobs dataset)

3. A MKL-style generalisation of two-sample testing on the base of finite linear combinations of baseline kernels of the form
$\mathcal{K}:=\{k : k=\sum_{u=1}^d\beta_uk_u,\sum_{u=1}^d\beta_u\leq D,\beta_u\geq0, \forall u\in\{1,...,d\}\}$
Optimal weights of these combinations can be computed via solving the convex program
$\min \{ \beta^T\hat{Q}\beta : \beta^T \hat{\eta}=1, \beta\succeq0\}$
where $\hat{Q}$ is a linear time estimate of the covariance of the MMD estimates and $\hat{\eta}$ is a linear time estimate of the MMD.

Whenever combined kernels may capture more information relevant distinguishing probability distributions than one kernel, this method leads to better results.

(Opt and $L2$ use combined kernels)
It also has an interesting feature selection interpretation in the sense that the two-sample test provides information on which kernels (and therefore domains) are useful for locating distinguishing characteristics of distributions.

All above plots and results can be found in the report (link). Many results are joint work and went to the article "Optimal kernel choice for large-scale two-sample tests", which was accepted at NIPS 2012 while my report was written. Many thanks to the other authors, in particular to Arthur Gretton, Dino Sejdinovic, Bharath Sriperumbudur, and Massimiliano Pontil for all the fruitful discussions and guidance.

## Bachelor's disseration: Adaptive Kernel Methods for Sequence Classification in Bioinformatics

I wrote my Bachelor thesis in 2009/2010 in Duisburg, supervised by Prof. Hoffmann, Bioinformatics, Essen, and Prof. Pauli, Intelligent systems, Duisburg.

The work is about classification of amino acid sequences using SVM and string kernels. In particular, I compared the Distant Segments kernel by Sébastian Boisvert to a standard Spectrum kernel on a HIV dataset. In addition, I described a bisection based method to search for the soft-margin parameter of the underlying SVM, which outperformed a standard grid-search.