In this series of blog posts we start to break down the powerful tool of Transformers which are the underpinning technology which has catapulted Large Language Models (LLMs) from a Machine Learning research field into the public spot light!

Before we delve into the individual components of the Transformer model, it's crucial to first establish a foundational understanding of some key concepts. These include one-hot encoders, sequence models and the concept of attention.

In this short explainer we fly through one-hot encoders and build our first language model using what is called a first-order sequence model!

What is One-Hot Encoding?

In many real-life situations, the data we deal with isn't inherently numeric, such as images (where we interpret pixel intensities) or audio signals (which we represent as oscillograms or spectrograms). The initial step in our process, therefore, is transforming all the words into numerical values to facilitate mathematical operations. This is the process of one hot encoding.

Ok so how does it do it? Let's look at an example to "one hot encode".

"It was the best of times, it was the worst of times."

We start by choosing our vocabulary, the collection of words that we are going to consider in our model. We need to "tokenise" the string. This is the process of breaking down the text as a continuous string into words. We need to do this in a way which knows how to identify punctuation. For this we use nltk - this is the natural language toolkit in python (super useful!). So let's do that

# Define the text
text = "It was the best of times, it was the worst of times."
# Tokenize the text into words, considering punctuation
nltk.download('punkt')
words = nltk.word_tokenize(text)

This give us print(words)

>>> ['it', 'was', 'the', 'best', 'of', 'times', ',', 'it', 'was', 'the', 'worst', 'of', 'times', '.']

Each unique word (plus punctuation) could get a unique integer so

unique_words = np.unique(words)

which gives us

>>> ['it', 'was', 'the', 'best', 'of', 'times', ',', 'worst', '.']

One-hot Encoding - Create a mapping from words to indices

In one-hot encoding the assigned unique values to the words are used as an index, allowing a representation which in many application is much more computer-friendly. Hence, each symbol is depicted by an array predominantly filled with zeros, equivalent in length to the vocabulary. The one element associated with the words unique values or index holds the value of one.

So how do we do this on a computer, let's first automate this process of mapping a word to an index.

word_to_index = {word: i for i, word in enumerate(unique_words)}

In python dictionaries are really nice for this string to interger indexing. It allows me to do things like word_to_index["it"] and then I get unique index identifier return, in this case 1.

Ok so lets look at that mapping using print(word_to_index)

So the sentence "it was the best of times, it was the worst of times." becomes a sequence of one-dimensional arrays, which, after squeezing together, looks like a two-dimensional array (or a matrix).

Each of the columns identifies a word, the row identified by the "1" maps to "word" with the associated index defined by word_to_index.

First Order Sequence Models for Language

Ok we are going to dive in now and look at our first language model. It is called a first order sequence model, and is a member of a more general class of models called Markov Chain models. So let us first draw a transition model, for our data.

Figure 1. Graphical representation of a first-order sequence mdoel for our limited text data example

This is a graphical representation of our data, which shows give we are on a word, with what probability do we transition to another "word" in the vocabulary. The probabilities are calculated from the data, they are what we call "conditional probabilities". Given we have observed the "word 1" then what is the probability we next get "word 2". We write condition probabilities in statistics as follows

\pi(\rm{It}| \rm{was}) \quad \text{or} \quad \pi(\rm{the}| \rm{best}).

so in words this says, given the last word was "It" what is the probability, the next word is "was" or "times" respectively. Now given our limited data set we can see the word "It" we always observe "was", hence the transition probability is one for $\pi(\rm{It}| \rm{was})$ . If we look at $\pi(\rm{the}| \rm{best})$ in the data the is either followed by "best" or "worse" (occurring each once). Therefore, $\pi(\rm{the}| \rm{best}) = 0.5$ .

We are now in a place where we can start to "generate" text by simulating this Markov chain. To do this we introduce a concept called a transition matrix. Let us write down what it is for this system, and then unpick what it is.

The transition matrix is an N by N matrix, where N is the size of the vocabulary. In general the transition matrix is a sparse matrix, as not all words naturally follow each other. This is how we can compute it.


def build_transition_matrix(words, unique_words)
    # Create a transition matrix filled with zeros
    num_states = len(unique_words)
    transition_matrix = np.zeros((num_states, num_states))

    # Iterate through the text to fill the transition matrix
    for i in range(len(words) - 1):
        current_word = words[i]
        next_word = words[i + 1]
        current_word_index = word_to_index[current_word]
        next_word_index = word_to_index[next_word]
        transition_matrix[current_word_index][next_word_index] += 1
    
    # Normalize the rows of the transition matrix to get probabilities
    row_sums = transition_matrix.sum(axis=1, keepdims=True)
    row_sums[np.where(row_sums == 0)] = 1

    return transition_matrix / row_sums

If we look at an individual row (Let's pick the row associated with "the" - for which the index is $5$ ) each one of the separate columns in this row gives us a probability of the next word which will follow, here I highlight the 6th row, as this is associated with the word "the"

So here we see that the probability of next getting "best" or "worst" is equal at $0.5$ , but with all other at 0.0.

Using this transition matrix, we can then "sample" the next word from these probabilities. We can then generate the next word using the following bit of code

def generate_next_word(word, transition_matrix, word_to_index, unique_words):

 id = word_to_index[word]
 cols = np.where(transition_matrix[id] > 0.0)[0]
 probs = transition_matrix[id][cols]
 info = []
 for i, indx in enumerate(cols):
    info.append((unique_words[indx], probs[i]))
 
 return np.random.choice(unique_words[cols], p=probs), info

In this bit code np.random.choice(unique_words[cols], p=probs) samples from the vocabulary with probabilities given by the transition matrix.

Ok let us do this for this simple example and see what we get (it won't be that exciting)!

"it was the worst of times, it was the worst of times."

A very negative language model (by chance), but you get the idea. There are clearly going to be limitations, when we extend this, so let us look at a bigger example and unpick these limiations.

A Bigger Example

In the previous example we have demonstrated the approach, clearly there is a very limited vocabulary we consider. The generated sentences are therefore limited. We of course can process larger amounts of text data and see how we do. So in this case, we use the text from an early article written on our blog when we first setup digiLab, you can find the text here; and feel free to use it to try this text model out. We put the text in a file. Then following the same process as above, just with more data.

import numpy as np
import nltk

# Specify the path to the text file
file_path = "sample_text.txt"

# Read the text from the file
with open(file_path, "r") as file:
    text = file.read()

# Tokenize the text into words, considering punctuation
nltk.download('punkt')
words = nltk.word_tokenize(text)
unique_words = np.unique(words)

# Create a transition matrix filled with zeros
num_states = len(unique_words)
transition_matrix = np.zeros((num_states, num_states))

# Create a mapping from words to indices
word_to_index = {word: i for i, word in enumerate(unique_words)}

So first let's sample "is" and see what might come after my surname, we can looking at the ordinal relationship of the probable words

[(',', 0.05), ('Co-founder', 0.1), ('a', 0.15), ('already', 0.05), ('designed', 0.05), ('dosed', 0.05), ('expensive', 0.05), ('gold', 0.05), ('high', 0.05), ('how', 0.05), ("n't", 0.05), ('probabilistic', 0.05), ('really', 0.05), ('secured', 0.05), ('sparse', 0.05), ('take', 0.05), ('the', 0.05)]

Let's now generate 10 words starting from "Dodwell" using the following few lines of code.

sentence = ["Dodwell"]
num_words = 10
for i in range(num_words):
    next_word, info = generate_next_word(word[i], transition_matrix, word_to_index, unique_words)
    sentence.append(next_word)

this is what I get

>>> ['Dodwell', 'is', 'already', 'optimising', 'approaches', '.', 'We', 'enable', 'them', 'precisely', 'where']

No Shakespeare, and quite abstract, but sort of follows a possible sentence. Let's sample it again

['Dodwell', 'is', 'probabilistic', 'forecasting', '.', 'What', 'we', 'are', 'nuclear', 'reactors', '.']

Ok . . . . I do a fair bit of probabilistic forecasting in my job, but the second part is jibbish.

Typically we see poor performance of these first-order sequence models. Whilst these models are fun to start playing with, it is clear that a model which generates text with only information from the previous word will not generate good text. Words like "is" follow many phrases in English, if we use "is" to explain something, we need to connect the first bit in a sentence to the end of the sentence. The interactions between words can be longer than this. Use of "She is . . ." is a start of a sentence which clearly is referring to someone previously mentioned.

This concept of context really drives modern machine learning models for language, from the viewpoint we need to put our attention on previous parts of the text, which govern what should follow.

Round up and What next?

We explore the idea of one-hot encoding, where we convert categorical data (in this case text) to numerical data. This is an essential step in many machine learning algorithms, since a representation using numerical data enables us to do numerical operations. Furthermore, this integer representation enable us to manipulate lists of words in our chosen programming language.

Using this representation we then build the idea of first order sequence model through the computation of a transition model for language. Whilst this is our first model for language we see that the generated text is rather limited. We will use these ideas as the basis of building on the limitations of a first order sequence model which takes us on a journey to transformers and therefore "How does ChatGPT work?".

Understanding Transformers : One hot Encoders and First Order Sequence Models

What is One-Hot Encoding?

One-hot Encoding - Create a mapping from words to indices

First Order Sequence Models for Language

A Bigger Example

Round up and What next?

Featured Posts