Building a Tiny Language Model

Introduction

In the previous article, we built a simple neural network that had a layer of input nodes and a layer of output nodes. Deep learning models are called deep because they have many layers of nodes. In this article, we'll make our model slightly deeper by adding a hidden layer of nodes between the input and output layers. We'll then explore what this hidden layer is doing.

Adding a Hidden Layer

To add a hidden layer, we need to create two matrices instead of one. The first matrix connects the input layer to the hidden layer, and the second matrix connects the hidden layer to the output layer. Each matrix is trained in the same way as before, using backpropagation to adjust the weights based on the error in the output.

This is what the network looks like after 100 000 steps of training. Blue indicates a positive weight, and red indicates a negative weight. Click on an input word to see how the activations flow through the network.

Network size

While the network might look bigger with the extra layer, it actually has fewer weights than before. The previous network consisted of a single 5 x 5 matrix, giving 25 values. This network has a 5 x 2 matrix and a 2 x 5 matrix, giving a total of 20 values. (In practise, this setup has 7 bias values too, but we'll ignore those for simplicity.)

This might seem like a small difference, but as we increase the number of words in our vocabulary, the difference becomes more significant. The first network has V x V wieghts to learn, where V is the vocabulary size. The second network has 4 x V weights to learn. For a vocabulary of 10 000 words, that's 100 million weights versus 40 000 weights. This assumes we keep the hidden layer size fixed at 2, but even if we increase it to 100, that's still only 2 million weights to learn. We'll discuss below the impact of changing the size of the hidden layer.

Going Deeper

Introduction

Adding a Hidden Layer

Network size