by Dr Mikkel Lykkegaard
Updated 8 February 2023
DaFT: DerivAtiveFree Thinning
Data thinning, core subset selection, and data compression are critical but often overlooked techniques in data science, machine learning and datadriven science and engineering. When executed correctly, they can lead to massive speedups of your analytics pipeline.
In this article, we will unpack these topics, show a few clever tricks, and demonstrate how these ideas can be seamlessly integrated into your existing machinelearning workflow.
Introduction
Data thinning is a problem that arises in multiple different contexts in data science and uncertainty quantification. Perhaps you want to fit a particular model to some data, but your dataset is huge, and your model happens to scale poorly with the size of the dataset. A good example is when building a Gaussian Process surrogate model. Or, perhaps you have many thousands of posterior MCMC samples that you want to run through a (more expensive) model, as illustrated in the figure below.
Image sources: Papadimas and Dodwell (2021), The Alan Turing Institute
In such cases, we would like to select an appropriate subset of the data to proceed with, and we refer to this selection process as "thinning". However, the problem can be formulated in multiple other ways:
 Choosing a core subset of the dataset.
 Achieving optimal compression of the data.
The data science team at digiLab has developed a new, lightweight data thinning method (Papadimas et al., 2023) and implemented it in the opensource Python library DaFT. We present here a brief overview of the algorithm and the software library. You can also check out the video and the slides of a presentation that I recently gave at the Workshop on "Multilevel and multifidelity sampling methods in UQ for PDEs" at the ESI Institute at Universität Wien.
To express the problem a bit more formally, consider having a dataset $X=\lbrace x_i \rbrace_{i=1}^{N} \sim P$, distributed according to some distribution $P$. We refer to $X$ as our empirical distribution. How the data has been generated is not important in this context; $P$ could be a Bayesian posterior (when thinning MCMC samples) or the distribution of a physical datagenerating process, e.g. sensor output. The problem is then to choose a subset $Y = \lbrace y_i  y_i \in X\rbrace_{i=1}^{n} \sim Q$ with $n \ll N$ and $Q$ "close to" $P$. The first question is then, what do we mean by "closeness"?
Closeness
In this section, we explore the concept of distance between empirical distributions. Feel free to skip ahead if you are more interested in applications than the underpinning theory.
One way to describe the distance between two empirical distributions $X \sim P$ and $Y \sim Q$ is by Maximum Mean Discrepancy (MMD). Consider a feature map $\phi: \mathcal{X} \to \mathcal{H}$ where $\mathcal{H}$ is a Reproducing Kernel Hilbert Space (RKHS). Then the MMD can be defined as (see e.g. Gretton et al., 2012):
For example, if simply $\phi(x) = x$, MMD measures the distance between the means of $X$ and $Y$. If $\phi(x) = (x, x^2)$, we also manage to capture the distance between their second moments, that is, their variances, and so on. However, we can make the feature space much richer by using the kernel trick $k(x,y) = \langle \phi(x),\phi(y) \rangle_\mathcal{H}$, like so:
The kernel, however, can be very computationally costly to actually compute, particularly for large, highdimensional datasets. Following Rahimi and Recht (2007), the data $X \in \mathbb{R}^d$ can instead be pushed through a random feature map $z: \mathbb{R}^d \to \mathbb{R}^D$ to a Euclidean inner product space, where the inner product approximates the kernel:
Some popular shiftinvariant kernels can be approximated by
For example, for a Gaussian kernel we use $\omega \sim \mathcal{N}(0,1) \in \mathbb{R}^d$ and $b \sim \mathcal{U}(0, 2\pi)$.
We then have
which is cheap to compute. While this MMD is only exact for an infinitedimensional feature space $\displaystyle{\lim_{D \to \infty}} z(x)^T z(y) = k(x,y)$, we are not that interested in the absolute MMD as such, since essentially what we want is to do is find some subset $Y = \lbrace y_i  y_i \in X\rbrace_{i=1}^{n}$ that minimises it.
Minimising the MMD
We are now faced with the optimisation problem of finding a subset $Y^\star \sim Q$ of our dataset $X \sim P$ that minimises the MMD with respect to $X$:
This problem amounts to choosing $n$ indices of $X$ that best reproduce the empirical distribution of $X$. It is a discrete optimisation problem with $\frac{N!}{n!(Nn)!}$ possible states and possibly many nearoptimal solutions.
This is the type of optimisation problem that is very easily tackled by Genetic Algorithms (GA). In the DaFT software package, we use the excellent opensource Python library DEAP.
Broadly, at generation $k=0$ we initialise a population of candidates ("chromosomes") $\mathcal Y_0 = \lbrace Y_{0,i}\rbrace_{i=1}^M$ with size $M$. For each generation $k$, pairs of chromosomes will be chosen at random, and their genes will be randomly swapped ("mating"). Similarly, single chromosomes will be randomly chosen and randomly perturbed ("mutation"). The fitness of each chromosome is then evaluated using the fitness function $f(Y_{k, i}) = MMD_X(Y_{k, i})$. Finally, the fittest chromosomes will be moved to the next generation $\mathcal Y_{k+1}$ by way of some selection criterion. For more details on Genetic Algorithms, see e.g. Mitchell (1998).
Genetic Algorithm
Repeating this process for sufficiently many generations will yield an optimal subset $Y^\star \subset X$, which, given the heurestic nature of GA, may not be the most optimal subset. However, it is many orders of magnitude cheaper to find than the exact optimum, and for most intents and purposes it doesn't matter.
Examples using DaFT
Example 1: Banana
Consider the banana distribution, were the normally distributed random variables $\tilde x_1, \tilde x_2 \sim \mathcal N(0,1)$ are transformed through the following map:
Our complete dataset $X$ consists of $N = 10000$ samples from this dataset and we are interested in extracting a subset $Y$ of $n = 20$. We could achieve that by just drawing random indices uniformly:
n_sub = 20
Y_random = X[np.random.randint(X.shape[0], size=n_sub), :]
Using DaFT, we could instead employ optimal thinning of the dataset to produce a core set:
daft = DaFT(X, n_sub)
Y_best, _ = daft.run(n=500, ngen=200)
Plotting the complete dataset along with the two different subsets, we get the following:
plt.figure(figsize=(6,4))
plt.scatter(X[:,0], X[:,1], alpha=0.1, c='k')
plt.scatter(Y_random[:,0], Y_random[:,1], alpha=1, c='tomato')
plt.scatter(Y_best[:,0], Y_best[:,1], alpha=1, c='springgreen')
plt.show()
Banana Distribution
In this case, the simple random sampler has oversampled the right tail of the distribution and missed out on the left tail, while DaFT has covered the entire distribution.
Example 2: Gaussian Blobs
In this example, we consider nine Gaussian blobs with different means and $N = 10000$ samples for the complete dataset $X$. Note that the central blob has twice as many samples as the others. For our subset $Y$, we now select $n = 10$ samples from the complete dataset.
Using the exact same approach as outlined above, we get the following output:
Gaussian Blobs
In this case, the simple random sampler has completely missed some of the blobs, while DaFT has taken a single sample from the centre of each of the peripheral blobs and two samples from the central blob, which contains twice as many samples as each of the other blobs.
The Reading

A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, A. Smola, A Kernel TwoSample Test, 2012, Journal of Machine Learning Research.

M. Mitchell, An Introduction to Genetic Algorithms, 1998, The MIT Press.

N. Papadimas, M. B. Lykkegaard, T. J. Dodwell, DerivativeFree Thinning (DaFT) with Stochastic Kernel Embeddings, 2023, Manuscript in Preparation.

A. Rahimi, B. Recht, Random Features for LargeScale Kernel Machines, 2007, Advances in Neural Information Processing Systems.
About digiLab
digiLab is an agile, deep tech company operating as a trusted partner to help industry leaders make actionable decisions through digital twin solutions. We invent firstofakind, worldleading machine learning solutions which empower senior decisionmakers to exploit the value locked in their data and models.
Featured Posts
If you found this post helpful, you might enjoy some of these other news updates.
Python In Excel, What Impact Will It Have?
Exploring the likely uses and limitations of Python in Excel
Richard Warburton
Large Scale Uncertainty Quantification
Large Scale Uncertainty Quantification: UMBridge makes it easy!
Dr Mikkel Lykkegaard
Expanding our AI Data Assistant to use Prompt Templates and Chains
Part 2  Using prompt templates, chains and tools to supercharge our assistant's capabilities
Dr Ana RojoEcheburúa