Introduction to Neural Networks

Programming Lab 12: Recurrent Neural Networks


This week we explored the recurrent class of neural networks. This class of networks have a number of potential architectures:


Fig 1. This figure is obtained from Andrej Karpathy's blog post and diagrams a variety of neural network architectures. Inputs are in red, hidden layer(s) in green, and outputs are in blue. Arrows between boxes denote transformations (usually matrix multiplication via weight matrices). The left-most diagram denotes standard networks that lack temporal structure. The remaining diagrams depict the various possible architectures of recurrent neural networks. In the one-to-many structure, a singular input produces a sequence of outputs, whereas the many-to-one takes numerous inputs and produces a singular output. Note an "RNN" that takes one input and produces one output (but has many steps between hidden states) is actually a traditional one-to-one network.

The great strength of recurrent neural networks (RNNs) is that they can model temporally-dependent processes, and they do so using knowledge of the hidden state at the previous timestep (\(t-1\)) to produce an output at the current timestep (\(t\)). For example, if an RNN is presented with sequential inputs (e.g. words in a sentence), it can learn temporal associations between inputs (i.e. which words are used together). This knowledge is passed through the hidden states (\(h\)). Such an architecture is demonstrated in Figure 2, where the network receives a word (input) at each timestep and must produce the next word in the sequence (output):


Fig 2. An example of an RNN receiving an input word and producing an output word at each timestep. In this example, the input sequence is "the quick brown fox jumped over...", and the target output sequence is "quick brown fox jumped over the...". Note that the target output sequence is one word ahead of the inputs, and thus this RNN is predicting the subsequent word based on the input word, and the history of previous words.

As another example, we can use RNNs to model how the brain might control behaviours that evolve over time, such as how neural activity in the motor cortex might produce muscular activity that moves the arm during a reaching movement (see this paper).Thus, RNNs extend the behaviours that can be modeled with neural networks beyond classification tasks (as we have previously discussed).

Backpropagation through time

Like in convolutional neural networks (CNNs), we also use backpropagation to train parameters (or weights) in RNNs. Because of the temporal/sequential aspect of RNNs, backpropagation in RNNs is often called backpropagation through time (BPTT). The main difference between BPTT and the backpropagation we used in one-to-one neural networks is that we must ensure, at each timestep, that we account for the loss that occurs in subsequent time steps because we pass our information along to the future. Put another way, if you misinterpreted a word in the sentence, your understanding of the next words would change. The error here is driven by the fact that you misunderstood a word early in the sentence - not that you did not understand each word individually. BPTT addresses these kinds of dependencies.

Getting started

In this lab we will be training an RNN to learn associations between words, like in Figure 2. We will be using PyTorch to create a simple RNN that will learn word-level associations. Depending on how complex this model is, it may or may not generate phrases that are plausible after training. This is because in the lab, we will be using a relatively simple, single-layer RNN. More complex models (e.g. more hidden units, layers, different RNN architecture, etc.) would produce more realistic phrases. We encourage you to try and implement more complex models after walking through the lab (see LSTMs).

To start, we will then import the necessary libraries:

import os
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
from argparse import Namespace
# PyTorch things:
import torch
import torch.nn as nn

Finally, we will define a number of network parameters now, such that we can more easily adjust them to understand how changes in these parameters will influence network performance:

OPTS = Namespace(
    train_file = 'lyrics_PhilCollins.txt',	# text file we will get word data from
    num_epoch = 50,				# number of times the network will read through the full file during training
    seq_size = 20,				# number of words in each sequence (one full pass through the RNN)
    batch_size = 16,				# how many sequences we will run through before calculating loss and updating weights
    num_layers = 1,				# how many hidden layers
    embedding_size = 64, 			# dimensions in the embedding space (see class RNN for more info)
    hidden_size = 64,				# how many hidden units
    nonlinearity = 'tanh',			# activation function applied to hidden units
    learning_rate = 0.01,			# hard-coded learning rate for weight updates
    gradients_norm = 5, 			# gradient clipping threshold
    sampling_freq = 20,				# sample phrases from the network every x training iterations
    predict_top_k = 5, 				# when we sample from network during training, choose the next word from the top K matching words (k=1 will produce text very similar to trianing data, k>1 will explore associations)
)
What some of these options are used for might not be clear to you just yet (e.g. predict_top_k), and so we will discuss some of them later. Feel free to change these parameter options during this lab and test your changes.

Reading in a text file

We will be making an RNN that learns word-level associations by reading through a plain-text file. Much of the code is inspired from Trung Tran's excellent blog post, and the code to read in the text file and split it into batches is directly from his example.
# function to read in and parse text file
def readTextData(train_file, batch_size, seq_size):
    # read in the text file
    with open(train_file, 'r') as f:
        text = f.read()
    text = text.split()

    # parse words in the text data file
    word_counts = Counter(text)
    sorted_vocab = sorted(word_counts, key=word_counts.get, reverse=True)
    # create indicies for one-hot word vectors
    int_to_vocab = {k: w for k, w in enumerate(sorted_vocab)}
    vocab_to_int = {w: k for k, w in int_to_vocab.items()}
    # count the number of unique words
    n_vocab = len(int_to_vocab)
    print('Vocabulary size: {} unique words'.format(n_vocab))

    int_text = [vocab_to_int[w] for w in text]	# associate each word with a number
    num_batches = int(len(int_text) / (seq_size * batch_size)) # count the number of batches based on our sequence size and batch size
    in_text = int_text[:num_batches * batch_size * seq_size] # split our data into sequences of length OPTS.seq_size
    out_text = np.zeros_like(in_text)	# create the target ouptut text vectors, where the target output vectors are "one ahead" of the input vectors
    out_text[:-1] = in_text[1:]
    out_text[-1] = in_text[0]
    in_text = np.reshape(in_text, (batch_size, -1))
    out_text = np.reshape(out_text, (batch_size, -1))
    return int_to_vocab, vocab_to_int, n_vocab, in_text, out_text

Next, we will load in our text data from a .txt file. Here are some plain text files you can use for training. Please change OPTS.train_file to be the path to one of these .txt files. We will load one of these files, and display one input and target output sequence from the file. Note that the target sequence is "one-ahead" of the input sequence. Again, this is because the output at each timestep will be trained to be the input at the next timestep.

# get text data
int_to_vocab, vocab_to_int, n_vocab, in_text, out_text = readTextData(OPTS.train_file, OPTS.batch_size, OPTS.seq_size)
# look at one input sequence, and the target sequence it will be tested against (check input is previous output):
print('{:15s} | {:15s}'.format('INPUT','TARGET OUTPUT'))
for ii in range(OPTS.seq_size):
	input_word = int_to_vocab[in_text[0][ii]]
	target_output = int_to_vocab[out_text[0][ii]]
	print('{:15s} | {:15s}'.format(input_word, target_output))

Note that the vocabulary size is different for different files, so training with some files might be faster than with others.

We next want a function that will create a batch generator to allow us to easily loop through training iterations and apply weight updates. Each batch contains a number of sequences (set by OPTS.batch_size). For each batch, the network will be presented with each sequence, and the outputs from each sequence will be evaluated against their respective target sequences. We then compute the loss with respect to all weights, averaging across sequences in the batch. We then perform one set of weight updates. Therefore, we can appreciate that if our batch size includes all the sequences, then we only need to take one weight update to improve the output, on average, for all sequences. Increasing the batch size can thus make training faster (up to a certain point, see the video lectures for week 11, which discuss the use of mini batches).

In contrast, increasing the sequence size can make training harder as it will increase the probability of exploding or vanishing gradients. While we can avoid this by making the sequence length shorter, note that this will mean our network cannot learn long-term dependencies between words (i.e. ideas conveyed by the text may span multiple sequences).

# function to return batch generator from input text data
def get_batches(in_text, out_text, batch_size, seq_size):
    num_batches = np.prod(in_text.shape) // (seq_size * batch_size) # // is floor division operator
    for i in range(0, num_batches * seq_size, seq_size):
        yield in_text[:, i:i+seq_size], out_text[:, i:i+seq_size] # return generator (i.e. can only be looped through once before we need to create a new generator)

Make an RNN class for use with PyTorch

For PyTorch, we must define a class for our network. This class must contain two functions: an initialization function that produces the scaffolding of our network (i.e. layers, layer sizes, output layers); and a forward pass function that accepts our inputs and the hidden layer at the previous timestep, and produces the output of the network and the hidden layer at the current timestep.
class RNN(nn.Module):
	def __init__(self, n_vocab, seq_size, embedding_size, hidden_size):
		super(RNN, self).__init__()
		self.seq_size = seq_size
		self.hidden_size = hidden_size
		# We will include an embedding layer in our RNN (instead of 1-hot input vectors for words)
		self.embedding = nn.Embedding(n_vocab, embedding_size) # lookup table from indicies to vectors (words will be represented by a vector of numbers)
		# Build a vanilla RNN, not using LSTM here (which would perform better)
		self.rnn = nn.RNN( embedding_size, hidden_size, batch_first=True, nonlinearity = OPTS.nonlinearity, num_layers = OPTS.num_layers)
		# convert hidden layer state into output via linear transform: output = A(h_t) + biases
		self.dense = nn.Linear(hidden_size, n_vocab, bias=True) # hidden_size is input, n_vocab is output (word classification)

	# for RNNs in pytorch, we also need to define the forward pass:
	def forward(self, x, prev_hidden_state):
		# receive input (x) and previous hidden state (prev_hidden_state)
		embed = self.embedding(x) # lookup and return vector of word in embedding space
		output, current_hidden_state = self.rnn(embed, prev_hidden_state) # not doing T BPTT, so no need to detach here
		output = self.dense(output) # take the outputs from the hidden state and transform state from hidden state to vocab state
		return output, current_hidden_state

Using word embeddings instead of one-hot encoding

In the video lectures, we considered examples where letters were encoded as one-hot vectors. Although such an encoding is intuitive, it can be a problem when the size of our space (i.e. the number of letters/words) becomes very large. Imagine a dataset with 10000 words - each word would have one vector mapping, and thus the space of words would be incredibly sparse.

This also has the problem of being a space in which we cannot compare word similarities in a reasonable way. Therefore, the network would likely be restricted to learning only surface-level word associations. Although such associations are okay if we simply want a text generator, it is problematic if we want to explore how a network might learn more complex structure among the words (this is an entire research field).

Therefore, we will use word embeddings, which encode words in an x-dimensional space (where x is set by OPTS.embedding_size). Word embeddings are a more efficient way to encode words, as they require fewer dimensions than the total number of words in our dataset and allow for a continuous, non-sparse encoding space. In addition, we will allow the network to update the embedding space such that more complex associations between words might be learned over the course of training.

Although we do not do so in this lab, we can load the learned embedding space from the trained network and examine how words are clustered in this space. A common approach to do this is using t-SNE.

Instantiate the RNN

Let's prep our network:
if torch.cuda.is_available():
	device = torch.device("cuda")
	print("Using GPU")
else:
	device = torch.device("cpu")
	print("Using CPU")

# seed for reproducibility
torch.manual_seed(222)

# instantiate the network
net = RNN(n_vocab, OPTS.seq_size, OPTS.embedding_size, OPTS.hidden_size)
net = net.to(device)
criterion = nn.CrossEntropyLoss() # compare estimated embeddings with target embeddings
optimizer = torch.optim.Adam(net.parameters(), lr=OPTS.learning_rate)

Sampling from the network

Before we start training, we will define one last function to allow us to sample from the network. In contrast to CNNs, where we could test our model by testing its classification accuracy when presented with never-before-seen stimuli (e.g. images), to test our RNN we will sample from it- meaning that we will produce sentences. This is because our network here is trained to produce phrases. To test how good our network is at learning sequential associations between words, we will sample from our network every now and then. This will allow us to see how the network improves with more training. Sampling will allow us to test how well the network is doing at learning sequential associations between words. We will call this function every few iterations and have it print the sampled sequence to the console. On each call, we will give the network a sequence of two input words, and we will capture the output after the second word. We will then find the \(k\) nearest word matches, and randomly choose one of these words as the input to the next timesetp. This word selection process will then occur on every subsequent timestep in our sampling sequence. A figure of this operation is below:
Fig 3. How we use the RNN to generate a sample of text. \(k\) is set via OPTS.predict_top_k.

Note that here, we don't have a target sequence to compare the outputs to. Once we sample, we want to print the words to the console so we can see how good (or bad) the network is.
# Create function to sample from network during training
def sampleFromRNN(device, net, n_vocab, vocab_to_int, int_to_vocab, top_k):
	net.eval() # turn network to eval mode (otherwise, remains in training mode)
	words = ['I', 'can']#OPTS.initial_words

	hidden_state = torch.zeros(1, 1, OPTS.hidden_size)
	hidden_state = hidden_state.to(torch.float)

	for jj in words: # loop through the initial words in the supplied sampling sequence
		ix = torch.tensor([[vocab_to_int[jj]]]).to(torch.int64)
		output, hidden_state = net(ix, hidden_state)

	# find the top k matching words
	_, top_ix = torch.topk(output[0], k=top_k)
	choices = top_ix.tolist()
	# randomly select one of these top matching words as the next word in the sequence
	choice = np.random.choice(choices[0])

	words.append(int_to_vocab[choice])

	# begin sampling from selected word above as input 1
	for _ in range(25):
		ix = torch.tensor([[choice]]).to(torch.int64)
		output, hidden_state = net(ix, hidden_state)

		# do random choice selection again
		_, top_ix = torch.topk(output[0], k=top_k)
		choices = top_ix.tolist()
		choice = np.random.choice(choices[0])
		words.append(int_to_vocab[choice]) # add chosen word to sequence we print

	print(' '.join(words))

As per the function above, we will start our sampling with the two words "I" and "can". Feel free to change these words to anything you like (you can also provide more starting words). On each sample, the hidden state in response to the input sequence "I am" is computed and then used (and updated) during sampling. We will sample 25 more words.

Training the network

We will now train our network. The code is commented below. On each epoch (one pass through the full text file), we split the text into batches which are comprised of a handful of sequences (and their associated target outputs). Recall that each batch has a number of sequences (defined by OPTS.batch_size). We compute the loss and update the weights after the network is trained with all sequences within one batch.

Per sequence, we compute the output of the network at each timestep (i.e. the output in response to each word in the word sequence). We do this for each sequence per batch (there are multiple per batch), each time recording the output at each timestep. This is important, as the same weights are used in every sequence in this batch. We then calculate the loss, backpropagate the loss, clip the gradient (if necessary), and update the weights. Although we could make our batch size smaller (e.g. only one sequence), our gradient would shift all over the cost space, so our learning will not be as effective as when we have larger batch sizes.

Finally, we will sample text from the RNN on certain iterations during training. We can observe how the generated text improves over the course of learning. We will also plot the loss every few iterations to a figure to ensure that the network is indeed reducing the loss over training.

Note that the cost (i.e. loss) plotted in the figure sometimes increases relative to the previous update. This occurs because our batch size is not all possible inputs sequences. If, for each batch, we included all sequences from the text file, then our weight updates would always reduce the cost. However, when the batch does not include all possible sequences, the weight updates may not always be ideal for sequences in different batches. On average, however, we can appreciate that the cost decreases with more training.

Making even better RNNs

While this RNN can generate sensible sequences of words, it's not amazing. Indeed, these vanilla RNNs are limited in their ability to link concepts/knowledge across large timesteps. To combat this we can use a more complex class of RNNs, called Long Short Term Memory (LSTM) networks. These networks have both temporally local and distant memory storage, instead of the rather simple memory storage used in vanilla RNNs.

Additional resources