by Dr Ana Rojo-EcheburΓΊa

Updated 28 March 2024

# Predict in twinLab

Download the resources for this post here.

This article provides a detailed guide on using the module **Predict**, a powerful tool designed to facilitate predictive modeling tasks within the **twinLab** ecosystem.

We'll start by providing a brief overview on Bayesian Inference, laying the foundation for the module **Predict**. Additionally, we'll introduce Gaussian Processes (GPs) which are not only one of the most important and common modeling methods in the Bayesian setting but also play a crucial role in twinLab's capabilities.

Throughout the article, we'll accompany theoretical concepts with practical implementations. You'll find a code example illustrating how to use the module **Predict** in **twinLab.**

β¨By the end of this article, you will be equipped with the knowledge and tools necessary to confidently use the module within the twinLab ecosystem.β¨

**π But..why should you use twinLab for predictions?**

You might be wondering why bother with twinLab when you could make predictions with other available tools.

Well, twinLab simplifies the complex world of Bayesian Inference and Gaussian Processes, making it accessible even to those without an extensive background in data science or mathematics.

With twinLab, you don't need to worry about understanding every intricacy of these methods; *it handles the heavy lifting for you*. Moreover, twinLab doesn't just stop at providing algorithms; it optimises performance, using hardware resources efficiently to ensure speedy computations.

twinLab offers an intuitive solution that streamlines the process and maximises results.

## A brief introduction to Bayesian Inference

Before diving into Bayesian Inference, let's briefly revisit the two main approaches to probability theory, **frequentist** and **Bayesian**, and their own definition of probability:

- In the frequentist approach, probability is defined as the relative frequency of events occurring in repeated trials,
- while in the Bayesian approach, probability represents subjective belief based on prior knowledge.

In the context of **machine learning**, this translates to objectively observing data (frequentist) versus updating subjective knowledge as new data become available (Bayesian).

The fundamental concept underlying the Bayesian thinking is **Bayes' Theorem**, a fundamental concept that helps us update our beliefs about an event based on new evidence. In simpler terms, it allows us to adjust our initial assumptions or beliefs about something when we receive new information.

This theorem states that the posterior probability of an event given new evidence is proportional to the likelihood of the evidence given the event, multiplied by the prior probability of the event, and normalised by the marginal likelihood.

π¬ **Let's explain this in simple terms:**

Let's say you have an initial belief about something, which we call a "prior belief." As you gather new evidence, Bayes' Theorem helps you combine this new evidence with your initial belief to form a "posterior belief," which is your updated understanding of the situation.

Think of it like updating your guess about the weather for tomorrow. You start with a guess based on what you know today (your prior belief). Then, as you check the weather forecast or look outside and see clouds forming (new evidence), you adjust your guess accordingly (your posterior belief).

Bayes' Theorem provides a structured way to make these adjustments by considering the likelihood of the new evidence given your initial belief, and then combining it with your prior belief to arrive at a more informed conclusion.

This is written as:

$P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}$

Where:

- $P(A | B)$: The probability of event A occurring given that event B has occurred. This is called the
**posterior probability**. - $P(B | A)$: The probability of event B occurring given that event A has occurred. This is called the
**likelihood**. - $P(A)$: The prior probability of event A, representing our initial belief in the probability of A before considering new evidence. This is called the
**prior probability**. - $P(B)$: The probability of event B occurring. This is called the
**marginal likelihood**or**evidence**.

π **But...what is Bayesian Inference?**

Well, it is just a method in which Bayes' theorem is used to update the probability for a hypothesis as more evidence or information becomes available.

We begin with a prior belief about an event, expressed as a probability distribution. As new evidence (data) becomes available, we update our prior belief using the likelihood, resulting in the posterior belief. This iterative process allows us to reason about our beliefs in terms of probabilities, conditioning them on the available evidence.

### Linear regression in the Bayesian setting

Let's see how Bayesian Inference allows us to update our beliefs about the parameters of a linear regression model based on observed data, resulting in a more informed understanding of the relationship between the variables involved.

linear_regression

**Prior Distribution (Left)**: Before observing any data, we start with the prior distribution. This represents our initial beliefs about the parameters of the linear regression model, such as the slope and intercept of the line. The prior distribution provides a range of possible values for these parameters and indicates how likely each value is before considering any data.**Noisy Data (Middle)**: We then observe some data points, which are represented by the scattered points in the left image. These data points might not perfectly align with our initial beliefs due to factors like measurement error or randomness, hence the term "noisy data."**Update to Posterior Distribution (Right)**: Using Bayes' Theorem, we update our prior beliefs based on the observed data to obtain the posterior distribution. The posterior distribution represents our updated beliefs about the parameters of the linear regression model after taking the observed data into account. It combines the information from the prior distribution with the likelihood of observing the data given different parameter values.

Bayesian Inference excels in scenarios where not a lot of data is available, or there is some level of uncertainty associated with the data. Data uncertainty can arise from many sources, for example sensor noise, precision limitations, or stochastic simulation methodology (a way of using randomness to create many possible outcomes of a situation, helping us understand how things might happen in real life when there's uncertainty involved.)

By incorporating uncertainties into the prior distribution, we can propagate them forward to quantify uncertainty in the posterior result. This approach enables , **explainable**, and **trustworthy** inferences and predictions.

## Gaussian Processes

Gaussian Processes (GPs) are a mathematical tool that helps us understand and model relationships in data. They are closely related to Bayesian Inference because they are a Bayesian non-parametric method for regression and classification tasks.

Being non-parametric means that they don't assume a fixed number of parameters to describe the relationship between input and output variables. Instead, they model the relationship as a distribution over functions, allowing for flexibility in capturing complex patterns in the data without being constrained to a predefined functional form.

π You can learn more about GPs in this article.

In Bayesian inference, we update our beliefs about the likelihood of different outcomes based on observed data, incorporating prior knowledge and uncertainty. Similarly, Gaussian Processes represent a distribution over functions, allowing us to model uncertainty in predictions and update our beliefs about the underlying relationship between inputs and outputs as we observe more data. In essence, Gaussian Processes provide a flexible framework within Bayesian inference for modeling complex relationships and making predictions while accounting for uncertainty

GPs allow us to quantify uncertainty in our predictions by considering a range of possible functions that could describe our data.

**GPs serve as surrogate models (or emulators) within the twinLab ecosystem.**

A surrogate model is like a simplified copy of a complicated system or process that helps us make predictions and understand the original one without having to deal with all its complexities directly.

Similarly to how a Gaussian distribution is described by a mean and a standard deviation, a GP is completely described by a mean *function* $m(\mathbf{x})$ and a covariance *function* $k(x, x')$:

$f(x) \sim \mathcal{GP}(m(x), k(x, x'))$ Here:

- $f(x)$ represents a function drawn from the GP,
- $m(x)$ is the mean function, which provides the expected value of the function at a given input $x$,
- $k(x, x')$ is the kernel, determining how the function values at different inputs $x$ and $x'$ are correlated.

The mean function represents the average behaviour of the function, while the kernel captures how the function values vary with respect to each other across different inputs.

The choice of the kernel depends on the specific requirements of the GP model, and different kernels can be chosen based on the characteristics of the underlying data.

Some commonly used kernels are:

**Radial Basis Function (RBF)**: Captures smoothness in the data.

**Linear (LIN)**: Represents linear relationships between variables.

-**Periodic (PER)**: Models periodic patterns in the data.

kernels

*Sample functions drawn from the prior distribution of a GP using different kernel functions. Each kernel function imposes a unique structural bias on the distribution of functions.*

The figure above demonstrates 10 sample functions drawn from the 'bag of functions' defined by a GP without any data, shown for the three different kernel functions. In the absence of data, these samples can be considered to have been drawn from a *prior distribution*. Importantly, the form of a kernel function can be considered to be a kind of inductive bias: GPs built with different kernels would describe completely different distributions, even with the same data.

posterior_predictive

*Sample functions drawn from the posterior distribution of a GP with the RBF kernel. Red points are (noisy) observation data*

As shown in the figure above, GPs explicitly model the uncertainty of the predictive function. The model uncertainty is expressed in the form of the **covariance matrix** defined by the kernel function.

In simpler terms, the covariance matrix is like a table that tells us how related each pair of points in our dataset is and we calculate these relationships using a the kernel function, which determines how the covariance (or similarity) between any two points in the input space decreases as the distance between them increases.

This matrix helps us understand the patterns in our data and how one point's value might relate to another's.

For instance, the Radial Basis Function (RBF) assigns higher similarity (or covariance) to points that are closer together and lower similarity to points that are farther apart. It looks at the distance between each pair of points and calculates how similar they are. If two points are close, they'll have a high similarity score; if they're far apart, the score will be lower.

In the absence of data, both the model mean and prediction tends towards the prior distributions, whereas data clamps the prediction to the observations.

This uncertainty quantification is crucial or applications where safety and reliability are paramount, making GPs an ideal choice for uncertainty-aware and explainable modeling in twinLab.

π You can learn more about kernels in this article.

## How the module Predict works under the hood

The Predict module uses the power of GPs along with Bayesian Inference and uncertainty quantification techniques to provide reliable predictions and insights for various predictive modeling tasks.

**Data input:**Users provide input data containing features (predictors) and corresponding target variables (responses), which is the desired output.**Model training:**During training, the GP learns the underlying patterns and relationships in the data.**Prediction:**Once the GP model is trained, users can make predictions for new input data points.**Uncertainty quatification:**In addition to predicting the mean response, the GP also provides uncertainty estimates for each prediction. This uncertainty quantification is a key feature of GPs, allowing users to assess the reliability of the predictions.**Bayesian inference:**Bayesian inference techniques are used to update the model's beliefs about the underlying function based on observed data. This allows the GP to make informed predictions that incorporate both prior knowledge and new evidence.**Optimisation:**The Predict module also includes optimisation algorithms to optimise the Gaussian Process model.

## Hands-on example

**twinLab** empowers its users to use the **Predict** module seamlessly through a Python interface.

All the underlying technical details required to effectively fit a model or emulator to data are taken care of within the software itself, without compromising the user's ability to tune the software to their specific engineering problems.

### 1-D Scenario: Training, Prediction, and Uncertainty Quatification.

In this example, we will demonstrate how to train an emulator on a dataset with one input variable and one output variable.

The goal is to make accurate predictions for the output variable based on new input values, while also understanding the uncertainty associated with each prediction.

By visualising the predictions and uncertainties, users can gain insights into the behavior of the underlying function and make informed decisions based on the model's outputs.

##### Set Up

First, we import the requiered libraries: `pprint`

, `numpy`

, `pandas`

, `matplotlib`

and `twinLab`

.

```
# Standard imports
from pprint import pprint
# Third-party imports
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
# Project imports
import twinlab as tl
```

You will need to set up your API key.

β οΈ Remember not to share your API key publicly in your code or any public repositories to maintain security.

```
api_key = "your_api_key_here" # Set your API key here
tl.set_api_key(api_key)
```

#### Create a Dataset and Upload it to twinLab

In twinlab, datasets must be either in the form of a `pandas.DataFrame`

object or as filepaths pointing to CSV files that can be converted into a `pandas.DataFrame`

.

β οΈ It's important that both formats have clearly labeled columns.

Specifically, the input (predictor) variable should be labeled as `x`

, and the output variable as `y`

. Data in twinlab is expected to follow a column-feature format, where each row represents a single data sample and each column represents a data feature.

β We start by importing `random`

for generating random numbers. Then, we define the length of the lists to be generated as `list_length = 10`

. Subsequently, we generate a list `x`

containing `list_length`

random floating-point numbers between 0 and 1 using a list comprehension with `random.random()`

, and another list `y`

containing random floating-point numbers between -1 and 1. We create a DataFrame `df`

using `pandas`

, where the 'x' column corresponds to the values in list `x`

and the 'y' column corresponds to the values in list `y`

. Finally, we display the contents of the DataFrame `df`

to visually inspect the generated dataset before further processing.

```
import random
# Define the length of the lists
list_length = 10
# Generate random numbers for list x
x = [random.random() for _ in range(list_length)]
# Generate random numbers for list y
y = [random.random() * 2 - 1 for _ in range(list_length)]
# Creating the dataframe using the above arrays
df = pd.DataFrame({"x": x, "y": y})
# View the dataset before uploading
display(df)
```

twinLab offers a `Dataset`

class equipped with attributes and methods for processing, viewing, and summarising datasets. To access datasets in twinLab, they must be created with a `dataset_id`

, which serves as the identifier for accessing them. Datasets can be uploaded into twinLab using the `upload`

method.

β We define the name of the dataset as "example_data" using the variable `dataset_id`

. Next, we initialise a `Dataset`

object named `dataset`

with the specified dataset identifier. Then, we proceed to upload the dataset represented by the DataFrame `df`

into the twinLab environment using the `upload`

method of the `dataset`

object, with the parameter `verbose=True`

indicating that the upload process should provide detailed feedback.

```
# Define the name of the dataset
dataset_id = "example_data"
# Intialise a Dataset object
dataset = tl.Dataset(id=dataset_id)
# Upload the dataset
dataset.upload(df, verbose=True)
```

**Output:**
Dataframe is uploading.
Processing dataset
Dataset example_data was processed.

#### Train an emulator

The `Emulator`

class is responsible for training and using surrogate models. Similar to datasets, we assign an identifier to the model, which is how it will be saved in the cloud.

β We set the name of the emulator as "example_emulator" by assigning it to the variable `emulator_id`

. Then, we create an instance of the `Emulator`

class named `my_emulator`

with the specified identifier. We use the `train`

method of the `my_emulator`

object, providing the dataset we want to train the emulator on, denoted as `dataset`

, along with specifying the input variable "x" and the output variable "y" using the `inputs`

and `outputs`

arguments respectively.

```
# Define the name of the emulator
emulator_id = "example_emulator"
# Initialise the emulator
my_emulator = tl.Emulator(id=emulator_id)
# Use the train method
my_emulator.train(dataset=dataset,
inputs=["x"],
outputs=["y"])
```

#### Prediction Using the Trained Emulator

The surrogate model is now trained and saved to the cloud under the `emulator_id`

. It can now be used to make predictions.

β We define the inputs for the dataset using NumPy's `linspace`

function, creating an array `x_eval`

containing 128 evenly spaced values between 0 and 1. Subsequently, we convert this array into a DataFrame `df_eval`

with the column labeled as "x", and then display the DataFrame to inspect the generated input data. Next, we utilse the trained emulator `my_emulator`

to predict the results for the input data contained in `df_eval`

. The predictions are obtained using the `predict`

method of `my_emulator`

, and then stored in `predictions`

. After that, we concatenate the predicted mean and standard deviation values into a single DataFrame `result_df`

along the column axis. Finally, we extract the mean and standard deviation columns from `result_df`

and convert them into NumPy arrays `df_mean`

and `df_stdev`

respectively, before printing the head of `result_df`

to observe the first few rows of the predictions.

```
# Define the inputs for the dataset
x_eval = np.linspace(0, 1, 128)
# Convert to a dataframe
df_eval = pd.DataFrame({"x": x_eval})
display(df_eval)
# Predict the results
predictions = my_emulator.predict(df_eval)
result_df = pd.concat([predictions[0], predictions[1]], axis=1)
df_mean, df_stdev = result_df.iloc[:,0], result_df.iloc[:,1]
df_mean, df_stdev = df_mean.values, df_stdev.values
print(result_df.head())
```

The output of the training process includes two main components:

**Prediction mean**: This is the expected or average prediction for each data point.**Prediction standard deviation**: This indicates the level of uncertainty associated with each prediction.

Together, these components provide valuable insights into the model's predictions. The prediction mean gives you a central estimate, while the standard deviation offers a measure of the prediction's reliability. A higher standard deviation suggests greater uncertainty, while a lower standard deviation implies a more confident prediction.

#### Viewing The Preditions

`Emulator.predict`

outputs mean values for each input and their standard deviation; this gives the abilty to nicely visualise the uncertainty in results.

β We use the trained emulator `my_emulator`

to create a plot where the x-axis represents the input variable labeled as 'x' and the y-axis represents the output variable labeled as 'y'. The plot is labeled as "Emulator predictions" using the `label`

parameter. Additionally, we set the x-axis limits to be between 0 and 1 using `x_lim=(0,1)`

. Then, we overlay the training data points from the DataFrame `df`

, with the input values from column 'x' and the output values from column 'y', represented as red scatter points. Finally, we add a legend to the plot to distinguish between the emulator predictions and the training data, and then display the plot.

```
plt = my_emulator.plot(x_axis='x',y_axis='y',label="Emulator predictions", x_lim=(0,1))
plt.scatter(df['x'], df['y'], color='r', label='Training data')
plt.legend()
plt.show()
```

output

##### Sampling from the emulator

The `Emulator.sample`

function can be used to retrieve a number of results from your model. It requires the inputs for which you want the values and how many outputs to calculate for each.

β We define a set of sample inputs using NumPy's `linspace`

function, creating an array `sample_inputs`

containing 128 evenly spaced values between 0 and 1, and then converting it into a DataFrame with the column labeled as "x". Next, we specify the number of samples to be calculated for each input as 100, stored in the variable `num_samples`

. Using the trained emulator `my_emulator`

, we calculate the samples for the provided input data `sample_inputs`

, generating 100 samples for each input point. The results are stored in the `sample_result`

variable.

```
# Define the sample inputs
sample_inputs = pd.DataFrame({"x": np.linspace(0, 1, 128)})
# Define number of samples to calculate for each input
num_samples = 100
# Calculate the samples
sample_result = my_emulator.sample(sample_inputs, num_samples)
```

β We display the results of the sample calculations, represented as a DataFrame, using the `display`

function. This DataFrame contains the sample outputs generated by the emulator for each input point specified in the `sample_inputs`

. The `sample_result`

DataFrame provides insight into the variation and distribution of the emulator's predictions across the range of input values provided.

```
# View the results in the form of a dataframe
display(sample_result)
```

The results can be plotted giving a nice visualisation of the sampled data, with the model's uncertainity.

β We set up parameters for plotting: defining colors for curves and data points, setting the transparency level (`alpha_curve`

), and specifying whether to plot training data (`plot_training_data`

) and model bands (`plot_model_bands`

). If `plot_training_data`

is True, we plot the training data points from DataFrame `df`

with 'x' values against 'y' values. Then, we plot the sample results generated by the emulator for the given input data `sample_inputs`

. We limit the x-axis to the range between 0 and 1 using `plt.xlim((0.0, 1.0))`

, and label the x-axis as "$X$" and the y-axis as "$y$". Finally, we add a legend to the plot and display it. This plot provides a visualisation of both the training data and the samples drawn from the model, allowing for visual assessment of the emulator's performance and the distribution of predicted values across the input range.

```
# Plot parameters
color_curve = "deepskyblue"
alpha_curve = 0.10
color_data = "red"
plot_training_data = True
plot_model_bands = False
# Plot samples drawn from the model
if plot_training_data:
plt.plot(df["x"], df["y"], ".", color=color_data, label="Training data", markersize=10)
plt.plot(sample_inputs, sample_result["y"], color=color_curve, alpha=alpha_curve)
plt.xlim((0.0, 1.0))
plt.xlabel(r"$X$")
plt.ylabel(r"$y$")
plt.legend()
plt.show()
```

output2

##### Deleting emulators and datasets

With `my_emulator.delete()`

, we remove the emulator object that we previously created. This action ensures that any resources associated with this emulator, such as trained models or metadata, are cleared from memory or storage.

Using `dataset.delete()`

, we remove the dataset object that we created and used for training the emulators. By deleting this dataset, we free up resources and prevent it from being accessible for further analysis or training.

Deleting these objects, allows us to perform cleanup tasks to ensure that no unnecessary resources are retained in memory or storage after we've finished using the emulators and dataset. This practice helps optimise resource usage and prevents memory leaks.

```
# Delete dataset
dataset.delete()
# Delete emulator
my_emulator.delete()
```

### Emulators in higher dimensions

In many real-world situations, there's not just one factor influencing an outcome β there are often several.

Imagine you're trying to predict something like temperature, but you know it's not just one thing that affects it β it could be humidity, time of day, and more.

**Emulating these relationships means capturing how all these factors together influence the outcome.**

twinLab recognises this complexity and makes it easy to handle. With just a small tweak to the code we've already seen, you can model the relationships between multiple input variables.

The only portion of the code that needs to be modified is the following:

```
# Use the train method
my_emulator.train(dataset=dataset,
inputs=["x0", "x1", ...],
outputs=["y"])
```

Let's break this down.

`my_emulator.train()`

: This is the function used to train the emulator, just like before.`dataset`

: This is your dataset, which now includes multiple input variables (like humidity, time of day, etc.) as well as the output variable you're trying to predict (like temperature).`inputs`

: Here, you specify all the input variables you want the emulator to consider when making predictions. You list them inside the square brackets, separating each variable's name with a comma.`outputs`

: This is the output variable you're trying to predict, just like before. If you're trying to predict more than one thing, make sure listing all those outputs in the 'outputs' parameter. So, if you're trying to predict temperature, humidity, and wind speed, for example, you'd list all three variables inside the 'outputs' parameter.

GPs are known for their ability to smoothly interpolate between data points, meaning they can provide predictions for points in the input space even if they were not directly observed in the training data. This interpolation capability allows GPs to provide reliable predictions even in scenarios where data observations are sparse or noisy. The model can effectively fill in the gaps between observed data points, resulting in a comprehensive understanding of the underlying function's behavior.

## Resources

The complete code can be downloaded from the resource panel at the top of this article.

## Conclusion

The Predict module within twinLab is a powerful tool that provides the user with a smart way to deal with uncertainty while making predictions accurately.

By seamlessly integrating Gaussian Processes and Bayesian Inference, Predict empowers users to gain actionable insights while minimising the complexities traditionally associated with predictive modeling, without the need for extensive programming expertise.

Its intuitive interface and user-friendly design streamline the modeling process, allowing users to focus on deriving meaningful insights from their data.

Take the next step and experience the unparalleled capabilities of twinLab. Dive in and discover how Predict can revolutionise your approach to machine learning and decision-making.

## Featured Posts

If you found this post helpful, you might enjoy some of these other news updates.

Python In Excel, What Impact Will It Have?

Exploring the likely uses and limitations of Python in Excel

Richard Warburton

Large Scale Uncertainty Quantification

Large Scale Uncertainty Quantification: UM-Bridge makes it easy!

Dr Mikkel Lykkegaard

Expanding our AI Data Assistant to use Prompt Templates and Chains

Part 2 - Using prompt templates, chains and tools to supercharge our assistant's capabilities

Dr Ana Rojo-EcheburΓΊa