DaFT: DerivAtive-Free Thinning

Data thinning, core subset selection, and data compression are critical but often overlooked techniques in data science, machine learning and data-driven science and engineering. When executed correctly, they can lead to massive speed-ups of your analytics pipeline.

In this article, we will unpack these topics, show a few clever tricks, and demonstrate how these ideas can be seamlessly integrated into your existing machine-learning workflow.

Introduction

Data thinning is a problem that arises in multiple different contexts in data science and uncertainty quantification. Perhaps you want to fit a particular model to some data, but your dataset is huge, and your model happens to scale poorly with the size of the dataset. A good example is when building a Gaussian Process surrogate model. Or, perhaps you have many thousands of posterior MCMC samples that you want to run through a (more expensive) model, as illustrated in the figure below.

Image sources: Papadimas and Dodwell (2021), The Alan Turing Institute

In such cases, we would like to select an appropriate subset of the data to proceed with, and we refer to this selection process as "thinning". However, the problem can be formulated in multiple other ways:

Choosing a core subset of the dataset.
Achieving optimal compression of the data.

The data science team at digiLab has developed a new, lightweight data thinning method (Papadimas et al., 2023) and implemented it in the open-source Python library DaFT. We present here a brief overview of the algorithm and the software library. You can also check out the video and the slides of a presentation that I recently gave at the Workshop on "Multilevel and multifidelity sampling methods in UQ for PDEs" at the ESI Institute at Universität Wien.

To express the problem a bit more formally, consider having a dataset $X=\lbrace x_i \rbrace_{i=1}^{N} \sim P$ , distributed according to some distribution $P$ . We refer to $X$ as our empirical distribution. How the data has been generated is not important in this context; $P$ could be a Bayesian posterior (when thinning MCMC samples) or the distribution of a physical data-generating process, e.g. sensor output. The problem is then to choose a subset $Y = \lbrace y_i | y_i \in X\rbrace_{i=1}^{n} \sim Q$ with $n \ll N$ and $Q$ "close to" $P$ . The first question is then, what do we mean by "closeness"?

Closeness

In this section, we explore the concept of distance between empirical distributions. Feel free to skip ahead if you are more interested in applications than the underpinning theory.

One way to describe the distance between two empirical distributions $X \sim P$ and $Y \sim Q$ is by Maximum Mean Discrepancy (MMD). Consider a feature map $\phi: \mathcal{X} \to \mathcal{H}$ where $\mathcal{H}$ is a Reproducing Kernel Hilbert Space (RKHS). Then the MMD can be defined as (see e.g. Gretton et al., 2012):

\text{MMD}(P,Q) := \lVert \mathbb{E}_{X \sim P} [\phi(X)] - \mathbb{E}_{Y\sim Q} [\phi(Y)] \rVert _\mathcal{H}

For example, if simply $\phi(x) = x$ , MMD measures the distance between the means of $X$ and $Y$ . If $\phi(x) = (x, x^2)$ , we also manage to capture the distance between their second moments, that is, their variances, and so on. However, we can make the feature space much richer by using the kernel trick $k(x,y) = \langle \phi(x),\phi(y) \rangle_\mathcal{H}$ , like so:

\begin{aligned} \text{MMD}^2 (P,Q) =& \lVert \mathbb{E}_{X \sim P} [\phi(X)] - \mathbb{E}_{Y\sim Q} [\phi(Y)] \rVert^2_\mathcal{H} \\\\ = & \langle \mathbb{E}_{X \sim P} [\phi(X)], \mathbb{E}_{X' \sim P} [\phi(X')] \rangle + \\ &\langle \mathbb{E}_{Y \sim Q} [\phi(Y)], \mathbb{E}_{Y' \sim Q} [\phi(Y')] \rangle - \\ &2 \langle \mathbb{E}_{X \sim P} [\phi(X)], \mathbb{E}_{Y \sim Q} [\phi(Y)] \rangle \ \\\\ =& \mathbb{E}_{X,X'\sim P}[k(X,X')] + \\ &\mathbb{E}_{Y,Y'\sim Q}[k(Y,Y')] - \\ &2 \mathbb{E}_{X\sim P, Y \sim Q}[k(X,Y)] \end{aligned}

The kernel, however, can be very computationally costly to actually compute, particularly for large, high-dimensional datasets. Following Rahimi and Recht (2007), the data $X \in \mathbb{R}^d$ can instead be pushed through a random feature map $z: \mathbb{R}^d \to \mathbb{R}^D$ to a Euclidean inner product space, where the inner product approximates the kernel:

k(x,y) = \langle \phi(x),\phi(y) \rangle_\mathcal{H} \approx z(x)^T z(y).

Some popular shift-invariant kernels can be approximated by

z(x) = \sqrt{\frac{2}{D}} \; [\cos(\omega_1^T x + b_1), \dots, \cos(\omega_D^T x + b_D)]^T.

For example, for a Gaussian kernel we use $\omega \sim \mathcal{N}(0,1) \in \mathbb{R}^d$ and $b \sim \mathcal{U}(0, 2\pi)$ .

We then have

\text{MMD}(P,Q) \approx \left \lVert \frac{1}{N} \sum_{i=1}^N z(x_i) - \frac{1}{n} \sum_{j=1}^n z(y_j) \right \rVert_2

which is cheap to compute. While this MMD is only exact for an infinite-dimensional feature space $\displaystyle{\lim_{D \to \infty}} z(x)^T z(y) = k(x,y)$ , we are not that interested in the absolute MMD as such, since essentially what we want is to do is find some subset $Y = \lbrace y_i | y_i \in X\rbrace_{i=1}^{n}$ that minimises it.

Minimising the MMD

We are now faced with the optimisation problem of finding a subset $Y^\star \sim Q$ of our dataset $X \sim P$ that minimises the MMD with respect to $X$ :

Y^\star = \underset{Y \subset X}{\text{arg min}} \quad \text{MMD}(P, Q) = \underset{Y \subset X}{\text{arg min}} \quad \text{MMD}_{X}(Y)

This problem amounts to choosing $n$ indices of $X$ that best reproduce the empirical distribution of $X$ . It is a discrete optimisation problem with $\frac{N!}{n!(N-n)!}$ possible states and possibly many near--optimal solutions.

This is the type of optimisation problem that is very easily tackled by Genetic Algorithms (GA). In the DaFT software package, we use the excellent open-source Python library DEAP.

Broadly, at generation $k=0$ we initialise a population of candidates ("chromosomes") $\mathcal Y_0 = \lbrace Y_{0,i}\rbrace_{i=1}^M$ with size $M$ . For each generation $k$ , pairs of chromosomes will be chosen at random, and their genes will be randomly swapped ("mating"). Similarly, single chromosomes will be randomly chosen and randomly perturbed ("mutation"). The fitness of each chromosome is then evaluated using the fitness function $f(Y_{k, i}) = -MMD_X(Y_{k, i})$ . Finally, the fittest chromosomes will be moved to the next generation $\mathcal Y_{k+1}$ by way of some selection criterion. For more details on Genetic Algorithms, see e.g. Mitchell (1998).

Genetic Algorithm

Repeating this process for sufficiently many generations will yield an optimal subset $Y^\star \subset X$ , which, given the heurestic nature of GA, may not be the most optimal subset. However, it is many orders of magnitude cheaper to find than the exact optimum, and for most intents and purposes it doesn't matter.

Examples using DaFT

Example 1: Banana

Consider the banana distribution, were the normally distributed random variables $\tilde x_1, \tilde x_2 \sim \mathcal N(0,1)$ are transformed through the following map:

\begin{aligned} x_1 &= 2 \tilde x_1 \\ x_2 &= \frac{\tilde x_2}{2} + \frac{1}{2}(\tilde x_1^2 + 4)\end{aligned}

Our complete dataset $X$ consists of $N = 10000$ samples from this dataset and we are interested in extracting a subset $Y$ of $n = 20$ . We could achieve that by just drawing random indices uniformly:

n_sub = 20
Y_random = X[np.random.randint(X.shape[0], size=n_sub), :]

Using DaFT, we could instead employ optimal thinning of the dataset to produce a core set:

daft = DaFT(X, n_sub)
Y_best, _ = daft.run(n=500, ngen=200)

Plotting the complete dataset along with the two different subsets, we get the following:

plt.figure(figsize=(6,4))
plt.scatter(X[:,0], X[:,1], alpha=0.1, c='k')
plt.scatter(Y_random[:,0], Y_random[:,1], alpha=1, c='tomato')
plt.scatter(Y_best[:,0], Y_best[:,1], alpha=1, c='springgreen')
plt.show()

Banana Distribution

In this case, the simple random sampler has oversampled the right tail of the distribution and missed out on the left tail, while DaFT has covered the entire distribution.

Example 2: Gaussian Blobs

In this example, we consider nine Gaussian blobs with different means and $N = 10000$ samples for the complete dataset $X$ . Note that the central blob has twice as many samples as the others. For our subset $Y$ , we now select $n = 10$ samples from the complete dataset.

Using the exact same approach as outlined above, we get the following output:

Gaussian Blobs

In this case, the simple random sampler has completely missed some of the blobs, while DaFT has taken a single sample from the centre of each of the peripheral blobs and two samples from the central blob, which contains twice as many samples as each of the other blobs.

The Reading

A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, A. Smola, A Kernel Two-Sample Test, 2012, Journal of Machine Learning Research.
M. Mitchell, An Introduction to Genetic Algorithms, 1998, The MIT Press.
N. Papadimas, M. B. Lykkegaard, T. J. Dodwell, Derivative-Free Thinning (DaFT) with Stochastic Kernel Embeddings, 2023, Manuscript in Preparation.
A. Rahimi, B. Recht, Random Features for Large-Scale Kernel Machines, 2007, Advances in Neural Information Processing Systems.

About digiLab

digiLab is an agile, deep tech company operating as a trusted partner to help industry leaders make actionable decisions through digital twin solutions. We invent first-of-a-kind, world-leading machine learning solutions which empower senior decision-makers to exploit the value locked in their data and models.