Building a Tiny Language Model

Introduction

The aim of this series of articles is to understand how large language models (LLMs) work by building a very simple version. I've found many articles and videos explaining now neural nets work, or how LLMs work in theory, but I haven't found any that show exactly how an LLM works. That's probably because real LLMs are so complex that no one really understands exactly what they're doing. I'm hoping if I keep things really simple, I can build a tiny language model that is fully understandable. In particular, I want to understand how attention mechanisms and transformers work, since these are the key innovations that make LLMs so powerful.

                    In this first article, we'll cover tokenisation, a simple neural network, and one-hot encoding.
                

Some simple sentences about sheep

For this exploration of language models, I'm going to imagine there's a group of primitive cave people trying to describe their world. As they learn to speak, we will create an AI that learns to use their language. The language they use will be a stripped down version of English, that has a very limited vocabulary and simple grammar.

In the beginning, our cave people look out at the world and see some sheep. They describe them with two sentences. These are the only possible sentences they can say.

Sheep are herbivores
Sheep are slow

You could argue that 'herbivores' is not a particularly simple word, but the actual words don't matter. The point is there are only two possible sentences, and they have simple Subject - Verb - Object structure. How can we make an AI that learns to generate these sentences?

Tokenisation

In order to generate a sentence, we will create an neural network that learns to predict the next word in a sentence. The first step is to split the sentences into tokens. For this simple model, a token is just a word. I've also added a special token to indicate the start or end of a sentence, which will be useful later. I'm using for this token but it doesn't matter what it is as long as it won't appear in a normal sentence.

The sentences become lists of tokens:

, sheep, are, herbivores,
, sheep, are, slow,

In a real LLM, words get broken down into smaller tokens. For example, 'herbivores' might be split into two tokens 'herbivore', and 's', allowing the LLM to more easily understand that 'herbivores' is the plural of 'herbivore'.

Neural network

The purpose of the neural network is to take in one word (token) at a time, and try to predict the next word. So we have one input node for every token and one output node for every token. To start with, we'll create the simplest possible model which is a single matrix mapping the inputs to the outputs.

I won't explain how the neural network learns in detail here, but it involves adjusting the weights of the connections between nodes (values in the matrix) based on errors in its predictions. There are many good explanations online, such as 3Blue1Brown's Neural Networks series. My code for this example is available on Github.

Result

After training on these two sentences, with 10 000 repetitions, the neural network is a single matrix that looks like this. I've coloured positive values blue and negative values red.

TODO: make interactive

		Output
		<BR>	are	herbivores	sheep	slow
Input
	<BR>	-4.0494	-3.7335	-3.0846	4.2455	-4.0111
	are	-4.2853	-4.8309	1.3934	-5.3217	1.4553
	herbivores	4.9230	-3.4498	-2.8003	-2.8773	-2.8303
	sheep	-5.1692	3.9837	-3.2771	-3.3501	-2.6211
	slow	5.1529	-2.4297	-2.5776	-2.4191	-2.9745

We can draw the neural network as a set of input nodes and output nodes with connections between them. If we just draw the positive weights, it looks like this:

TODO: make interactive

To see what the model predicts, we look at the input node, then pick one positive connection from it to the output node.

Matrix view

Conclusion

We created a simple neural network model that consisted of a single matrix. It was able to learn how to generate two simple sentences about sheep. In the next article, we'll look at how to represent words with vectors.

Getting Started

Introduction

Some simple sentences about sheep

Tokenisation

Neural network

Result

Matrix view

Conclusion