predict in twinlab

📂 Resources

Download the resources for this post here.

This article provides a detailed guide on using the module Predict, a powerful tool designed to facilitate predictive modeling tasks within the twinLab ecosystem.

We'll start by providing a brief overview on Bayesian Inference, laying the foundation for the module Predict. Additionally, we'll introduce Gaussian Processes (GPs) which are not only one of the most important and common modeling methods in the Bayesian setting but also play a crucial role in twinLab's capabilities.

Throughout the article, we'll accompany theoretical concepts with practical implementations. You'll find a code example illustrating how to use the module Predict in twinLab.

Predict

✨By the end of this article, you will be equipped with the knowledge and tools necessary to confidently use the module within the twinLab ecosystem.✨

💭 But..why should you use twinLab for predictions?

You might be wondering why bother with twinLab when you could make predictions with other available tools.

Well, twinLab simplifies the complex world of Bayesian Inference and Gaussian Processes, making it accessible even to those without an extensive background in data science or mathematics.

With twinLab, you don't need to worry about understanding every intricacy of these methods; it handles the heavy lifting for you. Moreover, twinLab doesn't just stop at providing algorithms; it optimises performance, using hardware resources efficiently to ensure speedy computations.

twinLab offers an intuitive solution that streamlines the process and maximises results.

A brief introduction to Bayesian Inference

Before diving into Bayesian Inference, let's briefly revisit the two main approaches to probability theory, frequentist and Bayesian, and their own definition of probability:

In the frequentist approach, probability is defined as the relative frequency of events occurring in repeated trials,
while in the Bayesian approach, probability represents subjective belief based on prior knowledge.

In the context of machine learning, this translates to objectively observing data (frequentist) versus updating subjective knowledge as new data become available (Bayesian).

The fundamental concept underlying the Bayesian thinking is Bayes' Theorem, a fundamental concept that helps us update our beliefs about an event based on new evidence. In simpler terms, it allows us to adjust our initial assumptions or beliefs about something when we receive new information.

This theorem states that the posterior probability of an event given new evidence is proportional to the likelihood of the evidence given the event, multiplied by the prior probability of the event, and normalised by the marginal likelihood.

💬 Let's explain this in simple terms:

Let's say you have an initial belief about something, which we call a "prior belief." As you gather new evidence, Bayes' Theorem helps you combine this new evidence with your initial belief to form a "posterior belief," which is your updated understanding of the situation.

Think of it like updating your guess about the weather for tomorrow. You start with a guess based on what you know today (your prior belief). Then, as you check the weather forecast or look outside and see clouds forming (new evidence), you adjust your guess accordingly (your posterior belief).

Bayes' Theorem provides a structured way to make these adjustments by considering the likelihood of the new evidence given your initial belief, and then combining it with your prior belief to arrive at a more informed conclusion.

This is written as:

$P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}$

Where:

$P(A | B)$ : The probability of event A occurring given that event B has occurred. This is called the posterior probability.
$P(B | A)$ : The probability of event B occurring given that event A has occurred. This is called the likelihood.
$P(A)$ : The prior probability of event A, representing our initial belief in the probability of A before considering new evidence. This is called the prior probability.
$P(B)$ : The probability of event B occurring. This is called the marginal likelihood or evidence.

💭 But...what is Bayesian Inference?

Well, it is just a method in which Bayes' theorem is used to update the probability for a hypothesis as more evidence or information becomes available.

We begin with a prior belief about an event, expressed as a probability distribution. As new evidence (data) becomes available, we update our prior belief using the likelihood, resulting in the posterior belief. This iterative process allows us to reason about our beliefs in terms of probabilities, conditioning them on the available evidence.

Linear regression in the Bayesian setting

Let's see how Bayesian Inference allows us to update our beliefs about the parameters of a linear regression model based on observed data, resulting in a more informed understanding of the relationship between the variables involved.

linear_regression

Prior Distribution (Left): Before observing any data, we start with the prior distribution. This represents our initial beliefs about the parameters of the linear regression model, such as the slope and intercept of the line. The prior distribution provides a range of possible values for these parameters and indicates how likely each value is before considering any data.
Noisy Data (Middle): We then observe some data points, which are represented by the scattered points in the left image. These data points might not perfectly align with our initial beliefs due to factors like measurement error or randomness, hence the term "noisy data."
Update to Posterior Distribution (Right): Using Bayes' Theorem, we update our prior beliefs based on the observed data to obtain the posterior distribution. The posterior distribution represents our updated beliefs about the parameters of the linear regression model after taking the observed data into account. It combines the information from the prior distribution with the likelihood of observing the data given different parameter values.

Bayesian Inference excels in scenarios where not a lot of data is available, or there is some level of uncertainty associated with the data. Data uncertainty can arise from many sources, for example sensor noise, precision limitations, or stochastic simulation methodology (a way of using randomness to create many possible outcomes of a situation, helping us understand how things might happen in real life when there's uncertainty involved.)

interpretable

By incorporating uncertainties into the prior distribution, we can propagate them forward to quantify uncertainty in the posterior result. This approach enables , explainable, and trustworthy inferences and predictions.

Gaussian Processes

Gaussian Processes (GPs) are a mathematical tool that helps us understand and model relationships in data. They are closely related to Bayesian Inference because they are a Bayesian non-parametric method for regression and classification tasks.

Being non-parametric means that they don't assume a fixed number of parameters to describe the relationship between input and output variables. Instead, they model the relationship as a distribution over functions, allowing for flexibility in capturing complex patterns in the data without being constrained to a predefined functional form.

📝 You can learn more about GPs in this article.

In Bayesian inference, we update our beliefs about the likelihood of different outcomes based on observed data, incorporating prior knowledge and uncertainty. Similarly, Gaussian Processes represent a distribution over functions, allowing us to model uncertainty in predictions and update our beliefs about the underlying relationship between inputs and outputs as we observe more data. In essence, Gaussian Processes provide a flexible framework within Bayesian inference for modeling complex relationships and making predictions while accounting for uncertainty

GPs allow us to quantify uncertainty in our predictions by considering a range of possible functions that could describe our data.

GPs serve as surrogate models (or emulators) within the twinLab ecosystem.

A surrogate model is like a simplified copy of a complicated system or process that helps us make predictions and understand the original one without having to deal with all its complexities directly.

Similarly to how a Gaussian distribution is described by a mean and a standard deviation, a GP is completely described by a mean function $m(\mathbf{x})$ and a covariance function $k(x, x')$ :

$f(x) \sim \mathcal{GP}(m(x), k(x, x'))$ Here:

$f(x)$ represents a function drawn from the GP,
$m(x)$ is the mean function, which provides the expected value of the function at a given input $x$ ,
$k(x, x')$ is the kernel, determining how the function values at different inputs $x$ and $x'$ are correlated.

The mean function represents the average behaviour of the function, while the kernel captures how the function values vary with respect to each other across different inputs.

The choice of the kernel depends on the specific requirements of the GP model, and different kernels can be chosen based on the characteristics of the underlying data.

Some commonly used kernels are:

Radial Basis Function (RBF): Captures smoothness in the data.

k_{\textrm{RBF}}(x, x') = \sigma^2\exp\left(-\frac{(x - x')^2}{2\ell^2}\right)

Linear (LIN): Represents linear relationships between variables.

k_{\textrm{LIN}}(x, x') = \sigma_b^2 + \sigma_v^2(x - c)(x' - c)

-Periodic (PER): Models periodic patterns in the data.

k_{\textrm{PER}}(x, x') = \sigma^2\exp\left(-\frac{2\sin^2(\pi|x - x'|/p)}{\ell^2}\right)

kernels

Sample functions drawn from the prior distribution of a GP using different kernel functions. Each kernel function imposes a unique structural bias on the distribution of functions.

The figure above demonstrates 10 sample functions drawn from the 'bag of functions' defined by a GP without any data, shown for the three different kernel functions. In the absence of data, these samples can be considered to have been drawn from a prior distribution. Importantly, the form of a kernel function can be considered to be a kind of inductive bias: GPs built with different kernels would describe completely different distributions, even with the same data.

posterior_predictive

Sample functions drawn from the posterior distribution of a GP with the RBF kernel. Red points are (noisy) observation data

As shown in the figure above, GPs explicitly model the uncertainty of the predictive function. The model uncertainty is expressed in the form of the covariance matrix defined by the kernel function.

In simpler terms, the covariance matrix is like a table that tells us how related each pair of points in our dataset is and we calculate these relationships using a the kernel function, which determines how the covariance (or similarity) between any two points in the input space decreases as the distance between them increases.

This matrix helps us understand the patterns in our data and how one point's value might relate to another's.

For instance, the Radial Basis Function (RBF) assigns higher similarity (or covariance) to points that are closer together and lower similarity to points that are farther apart. It looks at the distance between each pair of points and calculates how similar they are. If two points are close, they'll have a high similarity score; if they're far apart, the score will be lower.

In the absence of data, both the model mean and prediction tends towards the prior distributions, whereas data clamps the prediction to the observations.

This uncertainty quantification is crucial or applications where safety and reliability are paramount, making GPs an ideal choice for uncertainty-aware and explainable modeling in twinLab.

📝 You can learn more about kernels in this article.

How the module Predict works under the hood

The Predict module uses the power of GPs along with Bayesian Inference and uncertainty quantification techniques to provide reliable predictions and insights for various predictive modeling tasks.

Data input: Users provide input data containing features (predictors) and corresponding target variables (responses), which is the desired output.
Model training: During training, the GP learns the underlying patterns and relationships in the data.
Prediction: Once the GP model is trained, users can make predictions for new input data points.
Uncertainty quatification: In addition to predicting the mean response, the GP also provides uncertainty estimates for each prediction. This uncertainty quantification is a key feature of GPs, allowing users to assess the reliability of the predictions.
Bayesian inference: Bayesian inference techniques are used to update the model's beliefs about the underlying function based on observed data. This allows the GP to make informed predictions that incorporate both prior knowledge and new evidence.
Optimisation: The Predict module also includes optimisation algorithms to optimise the Gaussian Process model.

Hands-on example

twinLab empowers its users to use the Predict module seamlessly through a Python interface.

All the underlying technical details required to effectively fit a model or emulator to data are taken care of within the software itself, without compromising the user's ability to tune the software to their specific engineering problems.

1-D Scenario: Training, Prediction, and Uncertainty Quatification.

In this example, we will demonstrate how to train an emulator on a dataset with one input variable and one output variable.

The goal is to make accurate predictions for the output variable based on new input values, while also understanding the uncertainty associated with each prediction.

By visualising the predictions and uncertainties, users can gain insights into the behavior of the underlying function and make informed decisions based on the model's outputs.

Set Up

First, we import the requiered libraries: pprint, numpy, pandas, matplotlib and twinLab.

# Standard imports
from pprint import pprint

# Third-party imports
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Project imports
import twinlab as tl

You will need to set up your API key.

⚠️ Remember not to share your API key publicly in your code or any public repositories to maintain security.

api_key = "your_api_key_here"  # Set your API key here
tl.set_api_key(api_key)

Create a Dataset and Upload it to twinLab

In twinlab, datasets must be either in the form of a pandas.DataFrame object or as filepaths pointing to CSV files that can be converted into a pandas.DataFrame.

⚠️ It's important that both formats have clearly labeled columns.

Specifically, the input (predictor) variable should be labeled as x, and the output variable as y. Data in twinlab is expected to follow a column-feature format, where each row represents a single data sample and each column represents a data feature.

→ We start by importing random for generating random numbers. Then, we define the length of the lists to be generated as list_length = 10. Subsequently, we generate a list x containing list_length random floating-point numbers between 0 and 1 using a list comprehension with random.random(), and another list y containing random floating-point numbers between -1 and 1. We create a DataFrame df using pandas, where the 'x' column corresponds to the values in list x and the 'y' column corresponds to the values in list y. Finally, we display the contents of the DataFrame df to visually inspect the generated dataset before further processing.

import random

# Define the length of the lists
list_length = 10

# Generate random numbers for list x
x = [random.random() for _ in range(list_length)]

# Generate random numbers for list y
y = [random.random() * 2 - 1 for _ in range(list_length)]

# Creating the dataframe using the above arrays
df = pd.DataFrame({"x": x, "y": y})

# View the dataset before uploading
display(df)

twinLab offers a Dataset class equipped with attributes and methods for processing, viewing, and summarising datasets. To access datasets in twinLab, they must be created with a dataset_id, which serves as the identifier for accessing them. Datasets can be uploaded into twinLab using the upload method.

→ We define the name of the dataset as "example_data" using the variable dataset_id. Next, we initialise a Dataset object named dataset with the specified dataset identifier. Then, we proceed to upload the dataset represented by the DataFrame df into the twinLab environment using the upload method of the dataset object, with the parameter verbose=True indicating that the upload process should provide detailed feedback.

# Define the name of the dataset
dataset_id = "example_data"

# Intialise a Dataset object
dataset = tl.Dataset(id=dataset_id)

# Upload the dataset
dataset.upload(df, verbose=True)

Output: Dataframe is uploading. Processing dataset Dataset example_data was processed.

Train an emulator

The Emulator class is responsible for training and using surrogate models. Similar to datasets, we assign an identifier to the model, which is how it will be saved in the cloud.

→ We set the name of the emulator as "example_emulator" by assigning it to the variable emulator_id. Then, we create an instance of the Emulator class named my_emulator with the specified identifier. We use the train method of the my_emulator object, providing the dataset we want to train the emulator on, denoted as dataset, along with specifying the input variable "x" and the output variable "y" using the inputs and outputs arguments respectively.

# Define the name of the emulator
emulator_id = "example_emulator"

# Initialise the emulator
my_emulator = tl.Emulator(id=emulator_id)

# Use the train method
my_emulator.train(dataset=dataset, 
                  inputs=["x"], 
                  outputs=["y"])

Prediction Using the Trained Emulator

The surrogate model is now trained and saved to the cloud under the emulator_id. It can now be used to make predictions.

→ We define the inputs for the dataset using NumPy's linspace function, creating an array x_eval containing 128 evenly spaced values between 0 and 1. Subsequently, we convert this array into a DataFrame df_eval with the column labeled as "x", and then display the DataFrame to inspect the generated input data. Next, we utilse the trained emulator my_emulator to predict the results for the input data contained in df_eval. The predictions are obtained using the predict method of my_emulator, and then stored in predictions. After that, we concatenate the predicted mean and standard deviation values into a single DataFrame result_df along the column axis. Finally, we extract the mean and standard deviation columns from result_df and convert them into NumPy arrays df_mean and df_stdev respectively, before printing the head of result_df to observe the first few rows of the predictions.

# Define the inputs for the dataset
x_eval = np.linspace(0, 1, 128)

# Convert to a dataframe
df_eval = pd.DataFrame({"x": x_eval})
display(df_eval)

# Predict the results
predictions = my_emulator.predict(df_eval)
result_df = pd.concat([predictions[0], predictions[1]], axis=1)
df_mean, df_stdev = result_df.iloc[:,0], result_df.iloc[:,1]
df_mean, df_stdev = df_mean.values, df_stdev.values
print(result_df.head())

The output of the training process includes two main components:

Prediction mean: This is the expected or average prediction for each data point.
Prediction standard deviation: This indicates the level of uncertainty associated with each prediction.

Together, these components provide valuable insights into the model's predictions. The prediction mean gives you a central estimate, while the standard deviation offers a measure of the prediction's reliability. A higher standard deviation suggests greater uncertainty, while a lower standard deviation implies a more confident prediction.

Viewing The Preditions

Emulator.predict outputs mean values for each input and their standard deviation; this gives the abilty to nicely visualise the uncertainty in results.

→ We use the trained emulator my_emulator to create a plot where the x-axis represents the input variable labeled as 'x' and the y-axis represents the output variable labeled as 'y'. The plot is labeled as "Emulator predictions" using the label parameter. Additionally, we set the x-axis limits to be between 0 and 1 using x_lim=(0,1). Then, we overlay the training data points from the DataFrame df, with the input values from column 'x' and the output values from column 'y', represented as red scatter points. Finally, we add a legend to the plot to distinguish between the emulator predictions and the training data, and then display the plot.

plt = my_emulator.plot(x_axis='x',y_axis='y',label="Emulator predictions", x_lim=(0,1))
plt.scatter(df['x'], df['y'], color='r', label='Training data')
plt.legend()
plt.show()

output

Sampling from the emulator

The Emulator.sample function can be used to retrieve a number of results from your model. It requires the inputs for which you want the values and how many outputs to calculate for each.

→ We define a set of sample inputs using NumPy's linspace function, creating an array sample_inputs containing 128 evenly spaced values between 0 and 1, and then converting it into a DataFrame with the column labeled as "x". Next, we specify the number of samples to be calculated for each input as 100, stored in the variable num_samples. Using the trained emulator my_emulator, we calculate the samples for the provided input data sample_inputs, generating 100 samples for each input point. The results are stored in the sample_result variable.

# Define the sample inputs
sample_inputs = pd.DataFrame({"x": np.linspace(0, 1, 128)})

# Define number of samples to calculate for each input
num_samples = 100

# Calculate the samples
sample_result = my_emulator.sample(sample_inputs, num_samples)

→ We display the results of the sample calculations, represented as a DataFrame, using the display function. This DataFrame contains the sample outputs generated by the emulator for each input point specified in the sample_inputs. The sample_result DataFrame provides insight into the variation and distribution of the emulator's predictions across the range of input values provided.

# View the results in the form of a dataframe
display(sample_result)

The results can be plotted giving a nice visualisation of the sampled data, with the model's uncertainity.

→ We set up parameters for plotting: defining colors for curves and data points, setting the transparency level (alpha_curve), and specifying whether to plot training data (plot_training_data) and model bands (plot_model_bands). If plot_training_data is True, we plot the training data points from DataFrame df with 'x' values against 'y' values. Then, we plot the sample results generated by the emulator for the given input data sample_inputs. We limit the x-axis to the range between 0 and 1 using plt.xlim((0.0, 1.0)), and label the x-axis as " $X$ " and the y-axis as " $y$ ". Finally, we add a legend to the plot and display it. This plot provides a visualisation of both the training data and the samples drawn from the model, allowing for visual assessment of the emulator's performance and the distribution of predicted values across the input range.


# Plot parameters
color_curve = "deepskyblue"
alpha_curve = 0.10
color_data = "red"
plot_training_data = True
plot_model_bands = False

# Plot samples drawn from the model
if plot_training_data:
    plt.plot(df["x"], df["y"], ".", color=color_data, label="Training data", markersize=10)
plt.plot(sample_inputs, sample_result["y"], color=color_curve, alpha=alpha_curve)
plt.xlim((0.0, 1.0))
plt.xlabel(r"$X$")
plt.ylabel(r"$y$")
plt.legend()
plt.show()

output2

Deleting emulators and datasets

With my_emulator.delete(), we remove the emulator object that we previously created. This action ensures that any resources associated with this emulator, such as trained models or metadata, are cleared from memory or storage.

Using dataset.delete(), we remove the dataset object that we created and used for training the emulators. By deleting this dataset, we free up resources and prevent it from being accessible for further analysis or training.

Deleting these objects, allows us to perform cleanup tasks to ensure that no unnecessary resources are retained in memory or storage after we've finished using the emulators and dataset. This practice helps optimise resource usage and prevents memory leaks.

# Delete dataset
dataset.delete()

# Delete emulator
my_emulator.delete()

Emulators in higher dimensions

In many real-world situations, there's not just one factor influencing an outcome – there are often several.

Imagine you're trying to predict something like temperature, but you know it's not just one thing that affects it – it could be humidity, time of day, and more.

Emulating these relationships means capturing how all these factors together influence the outcome.

twinLab recognises this complexity and makes it easy to handle. With just a small tweak to the code we've already seen, you can model the relationships between multiple input variables.

The only portion of the code that needs to be modified is the following:

# Use the train method
my_emulator.train(dataset=dataset, 
                  inputs=["x0", "x1", ...], 
                  outputs=["y"])

Let's break this down.

my_emulator.train(): This is the function used to train the emulator, just like before.
dataset: This is your dataset, which now includes multiple input variables (like humidity, time of day, etc.) as well as the output variable you're trying to predict (like temperature).
inputs: Here, you specify all the input variables you want the emulator to consider when making predictions. You list them inside the square brackets, separating each variable's name with a comma.
outputs: This is the output variable you're trying to predict, just like before. If you're trying to predict more than one thing, make sure listing all those outputs in the 'outputs' parameter. So, if you're trying to predict temperature, humidity, and wind speed, for example, you'd list all three variables inside the 'outputs' parameter.

GPs are known for their ability to smoothly interpolate between data points, meaning they can provide predictions for points in the input space even if they were not directly observed in the training data. This interpolation capability allows GPs to provide reliable predictions even in scenarios where data observations are sparse or noisy. The model can effectively fill in the gaps between observed data points, resulting in a comprehensive understanding of the underlying function's behavior.

Resources

The complete code can be downloaded from the resource panel at the top of this article.

Conclusion

The Predict module within twinLab is a powerful tool that provides the user with a smart way to deal with uncertainty while making predictions accurately.

By seamlessly integrating Gaussian Processes and Bayesian Inference, Predict empowers users to gain actionable insights while minimising the complexities traditionally associated with predictive modeling, without the need for extensive programming expertise.

Its intuitive interface and user-friendly design streamline the modeling process, allowing users to focus on deriving meaningful insights from their data.

Take the next step and experience the unparalleled capabilities of twinLab. Dive in and discover how Predict can revolutionise your approach to machine learning and decision-making.

Predict in twinLab