I finally finished an important and very cool extension to my GSoC 2012 project – making the linear time MMD statistic work with streaming based data. In particular, SHOGUN’s streaming framework is now used.
By design, the linear time MMD statistic, given as
is very well suited for streaming based data since only four examples have to be hold in memory at once. Once, the sum in the h-statistic is computed, used data can be “forgotten”. As I described in my M.Sc. thesis (link), this allows to process infinite amounts of data and therefore results in possibly more accurate two-sample tests. This holds in particular in cases where the amount of data needed to solve problems is larger than computer memory.
During the GSoC, I implemented the linear time MMD on the base of SHOGUN’s standard features interface, which made it necessary to hold data in memory. With the latest modifications (link to patch), the class for the linear time MMD (class reference), now accepts streaming features (class reference) only. This allows to process arbitrarily large amounts of data in a very comfortable way. In order to not suffer from overhead while streaming examples one by one, a block size may be specified: this number of examples is processed at once and should be chosen as large as fits into memory.
Recall the linear time MMD’s distribution is normal and its variance can easily estimated by using the empirical variance of the individual h-statistics (while the MMD is their mean) when the number of samples is large enough. The new implementation in SHOGUN does this on the fly using D. Knuth’s online variance algorithm  (implementation link). Therefore, a complete two-sample test is now possible in linear time and constant space.
A nice illustration of the advantages of this approach can be found in the examples for the linear time MMD (link). A data generator for artificial data which implements SHOGUN’s streaming interface is passed to the MMD class. It produces data from the underlying distribution on the fly.