Fine-tuning an ALBERT encoder for classification tasks
In this example we want to train an intent classifier. The goal of the intent classifier is to take queries in the form of text and output the intent behind the query. We will do this in the context of queries related to bank accounts. The training dataset is taken from PolyAI-LDN and consists of queries like "Is there a way to know when my card will arrive?" and their related intents like "card_arrival". In total the dataset contains queries related to 77 different intents.
We will perform the classification task with a deep neural network using the ALBERT encoder. Since this will lead to a very large deep neural network, that requires a lot of computational effort, we will use transfer learning to speed up the training process. Luckily, the ALBERT repository provides a model that has been pretrained on the Wikipedia text corpus. The tokenization of the input text is done with the sentencepiece text tokenizer.
Please keep in mind that the training of the whole ALBERT model requires performant hardware resources and can probably not be performed on a laptop. For the case of limited hardware resourced we provide an alternative in which only parts of the model will be trained.
Import required libraries
To perform the preprocessing and the training we will need the following libraries. We will use the sentencepiece library for the text tokenization. The training will be done with the Tensorflow.js library. Since the ALBERT encoder isn't available as an official model yet, we also need to import the
Albertmodel class. Additionally we need the library papaparse to parse .csv files into the JSON format.
Fetch the training data
As a next step we need to get the training and validation data. Both are available as .csv files. With the following commands we fetch the .csv files and save their text.
To make it easier to handle the data we use the papaparse library to convert the text in the csv format into a JSON object. Once we have done that, we can look at the entries individually. Each entry combines a query text and the corresponding intent.
In total there are 77 different intents. We can get a list of them from the given repository.
Now that we've collected all the data, it is time to put it in a form that the ALBERT encoder understands. We have to transform the queries as well as the intents into a numerical form. Lets start with the intents.
The task that we actually want to perform is called multiclass classification, where each query can only be labeled with one intent. For multiclass classification tasks it is common to represent the model output as a vector with the length of the number of intents. Each intent is then characterized by a vector with a "1" at its corresponding position and "0" everywhere else. This is called a one-hot encoded vector.
[ 1, 0, 0, 0, 0, 0, 0, ...] intent: "card_arrival"
[ 0, 1, 0, 0, 0, 0, 0, ...] intent: "card_linking"
[ 0, 0, 1, 0, 0, 0, 0, ...] intent: "exchange_rate"
In the next cell we create an object that maps each intent to its corresponding one-hot encoded vector.
With that we can convert every intent to its one-hot encoded representation.
Tokenize the training sentences
Now that we have intents ready, we have to get the queries into a numerical representation. For that the input is split into tokens. These can be individual words like "the, cat" or part of words like "ing, ed". These tokens are then translated into numbers using a vocabulary where every token is assigned a unique number.
We can convert the input text into token arrays using the sentencepiece library. The following command creates a sentencepiece preprocessor object from a given model file.
Let's try the preprocessor in action. We will first run the function
cleanText()to clean up the text und convert it to lowercase. Converting the text "Hello to tensorflow!" yields the following indices:
To see wether it worked we can decode the ids and check the resulting text. Keep in mind that we converted the text to lowercase beforehand.
We are now ready to convert our input sentences to their corresponding ids.
In addition to the input ids the encoder requires certain "control tokens" that convey semantic meaning to the model. The textual representation of the control tokens are
[PAD]. Each control token serves a different function in the input sequence. The
[CLS]token has to appear at the beginning of a sentence and specifies that we want to perform a classification task. The
[CLS]token is then followed by one of the input sentences. The end of the input sentence is marked by a
[SEP]token. In case of a sentence pair classification, another - sentence
[SEP]- pair has to be appended. To achieve a fixed length input,
[PAD]tokens are added to the end until a predetermined maximum sequence length is obtained.
Let's look at an example. The first input sentence has to look like this:
[CLS] I am still waiting on my card? [SEP] [PAD] [PAD] [PAD] ...
The indices of the control tokens are as follows:
To get the right format we need to prepend the
[CLS]token "2" to the input sentence, then append the
[SEP]token "3" and fill the rest of our input with
[PAD]tokens "0". Let's look at the example from before.
Formatted input sentence:
[CLS] I am still waiting on my card? [SEP] [PAD] [PAD] [PAD] ...
Corresponding input indices:
[ 2, 31,589,174,1672,27,51,2056,60, 3, 0, 0, 0, 0 ... ]
With the indices of our input sentences in the right format we need two additional inputs for the ALBERT model, the
typeIdsare needed for sentence pair classification. The
typeIdstensor has the same length as the
inputIdsand has the value "0" at every index of the first sentence and the value "1" at every index of the second sentence. In this example, with only one sentence, it contains only "0"s.
inputMaskmarks the non-padded region. It contains the value "1" where there is no padding and "0" where there is padding.
In the next cell we prepare the input for the training process. Therefore we perform all the steps that we just talked about. For this example we restrict the sequence length to 32 to reduce the computational effort required for training. This truncates some sentences and is typically not what you want to do. If you have the appropriate hardware available feel free to set the
We will do the same with the validation data.
Build the ALBERT classifier
With the training data in the correct shape, we can start building the deep neural network that will perform the classification task. The general architecture of our neural network is as follows:
- Input layer
- Albert layer
- Classification layer
The first layer is used as the input layer. It takes the three inputs
inputIds, segmentIds, attentionMask. Each one is of type integer.
The next layer is the ALBERT layer, which itself consists of the three internal layers, the embedding, the encoder, and the pooler layer. The embedding layer turns every token id into a vector representation of that token. The concept of representing text tokens as vectors can initially be hard to grasp. You can find out more about word embeddings in the tensorflow documentation. The goal is to have a mathematical representation in which "similar" words have similar vector representations. So for instance the tokens "cat" and "dog" are likely more similar than "cat" and "rocket". The benefit of the vector representation is that you can "measure" the similarity by performing the scalar product or the cosine similarity.
The second internal layer of the ALBERT layer is the encoder itself. It is a rather complex design that uses the transformer architecture. The ALBERT encoder is very similar to a BERT encoder layer, on which you can find more information here.
The last internal layer of the ALBERT layer is the pooler layer. It consists of a dense layer that performs a part of the classification.
For this example we will load the model architecture and the layer weights from the pretrained model provided by google. We can load the base model will the following command:
Since we are only interested in the albert layer of the base model, we create a variable that references that layer.
In order to reduce the required computational effort of training the model, we set the weights of the internal embedding layer of the albert layer to not be trainable. This reduces the model size by almost 4 mio. weights and should not lead to a significant accuracy loss.
If the model is still too large, the encoder layer can also be set non-trainable. This will depend on your hardware resources. However the accuracy impact might be larger.
Finally, we connect the input layers to our ALBERT layer. The first output of the ALBERT layer are the attention weights, which we don't need for our classification task.
After the ALBERT layer we will add another classification layer that transforms the output into the correct form for our specific classification task. For that we use a dense layer with 77 units and a softmax activation function. The softmax functions assigns each class a value between 0 and 1 and additionally all values add up to 1.
To reduce overfitting we add a dropout layer before the dense layer.
Again, we need to connect the dropout layer and the dense layer to the previous ALBERT layer.
The model is then created by providing the inputs and outputs of the model.
You can have a look at the classification model in your browser console with the following command:
Train the model
With the model architecture ready, it is time to train the model. The training adjusts the weights of the network such that it minimizes a given loss function. For that we need to specify which loss function we want to minimize. Since we are performing a multiclass classification task we want to use the categorical Crossentropy loss function.
The adjustment to the weights is done by some kind of gradient descent algorithm. The performance of which can greatly be improved by using an optimizer. In this case we will use the Adam optimizer with a learning rate of 0.00001. This is a rather small learning rate but since our model consists mostly of weights from transfer learning, we do not want them to change too quickly. They should already have near optimal values.
We specify the optimizer and loss function for our model with the following command:
Next, we create datasets for our training and validation data. With it we can specify the batch size for the training process. The batch sizes typically range from 4 to 64. Larger batch sizes lead to faster training but also require more powerful hardware. For this example you might need to reduce the batch size to 4, depending on your hardware.
The following cell starts the training process. It will save the model to the browsers IndexedDB every 10 minutes. You can follow the loss function in your browser console.
In case you want to skip the training process, you can load an already trained model from the following url:
To see whether the training was successful we can now look at some predictions. The following cell converts an input sentence into the required input tensors for the classification model.
The following cell computes the predictions and outputs the intent with the highest probabilty. You can try different input sentences and see whether the classifications make sense.
Finally you can look at the confidence with which the model made the classification.