Collaborative filtering for Movie recommendations
The aim of this tutorial is to train a neural network to give movie recommendations to users based on their ratings of movies they have already watched. Therefore we will create a model that uses the Collaborative filtering technique to make those predictions based on known preferences from other users. The required data for training the model is taken from the movielens dataset. This example is taken from the official keras examples.
First off, we import all required packages for machine learning, decompression, plotting and data manipulation.
Preprocessing the data
As the next step we require the data for training and validation. We can download the compressed dataset from the following url:
Let us then decompress the dataset.
Next, we will read the data from the
The data in the
movies.csvfile has the columns:
movieId, title, genres
The data in the
ratings.csvfile has the columns:
userId, movieId, rating, timestamp
To make the manipulation of the data a bit easier let's create a danfo dataframe for the ratings.
In order for the neural network to distinguish different users or movies, they are referred to by their indices. Sadly, the
Idsfrom the input dataset contain gaps and cannot be used. Consequently, we will create our own indices for both users and movies.
In the next cell we will collect the unique indices present in the dataset and than map to our own representation.
Additionally we have to add our indices to the dataframe.
Finally, let's collect some information like the number of users, the number of movies and minimum and maximum rating from our dataset.
Prepare training and validation data
Before we can pass the data to our training algorithm, it has to be put into the right form. Our network is supposed to be passed a user and a movie and then predict the rating. Therefore our input consists of two tensors: one that contains the user indices and one that contains the movie indices. The output tensor has to contain the ratings for the movies.
Before the creation of the input and output tensors, the data has to be shuffled.
Now that the data is ready to be used, we can start building the model. With tensorflow.js it is very convenient to use the Keras API that uses layers as an abstraction to build your model. So in the next step we will create the different layers that we will need later.
Firstly, we have to specify the inputs to our model. As mentioned before, we have one input for the user index and one input for the movie index. Since both are scalar values we set the shape to
. For performance reasons the algorithm typically groups multiple inputs together to form a batch. So the input will be extended by a batch dimension to have the shape
[batchSize, 1]. But the algorithm will do this automatically.
The datatype of our inputs is 32-bit integers.
The next layers are embedding layers, which play a very crucial role for the neural network. They turn the index, which has no mathematical structure, into a vector representation that our model can understand. The goal is to have a representation for users and movies that posses certain mathematical properties. This way one can measure the "similarity" between movies. Movies that are rated similar by each user should have a similar embedding representation.
For our example we choose the lengths of our embedding vectors to be
With that we create an embedding layer for our users and movies.
Additionally, let's create a bias layer for our users and movies.
With the main layers defined, we can start connecting the layers to define the model architecture. The architecture of a Collaboratorive filtering network looks roughly as shown in the output of the next cell.
Let us start connecting the layers according to the diagram. Keep in mind that it is common for machine learnig models that the input is at the bottom and the output is at the top of the diagram. Calling the
applymethod on a layer specifies the input for that layer.
First up, we connect the embedding layers for the users and movies to the input layers of the model.
Secondly, we compute a "match score" between the user and movie embeddings by using the dot product.
Let us then compute a per-user and per-movie bias, which we then add together with the computed "match score".
Finally we pass the computed sum the a sigmoid activation function that scales the output into the
With all layers connected, we can create the model by specifying its inputs and outputs.
For the trainging of the model we use a Binary Crossentropy loss function and an Adam optimizer with a learnign rate of
Before we train the model, let's create a plot where we can monitor the training progress.
The next cell starts the trainging process. For trainging it is very important to not only track the evolution of the loss function but also see how the loss function evolves on samples that are not used for training. These samples are called validation samples. By specifying the
validationSplit: 0.9, we tell the algorithm that we want to keep 10% of the samples to compute the validation loss function.
Be aware that executing the next cell might take a couple of minutes.
Show top 10 movie recommendations
Great, we have trained our model. Let us now see if we can get some movie recommendations for the some of the users. Of course we only want to recommend movies to them that they haven't already watched.
To handle the movie data, we create a dataframe for the movies.
Then, we pick one of the users by random.
Let's see which movies they watched and how they ranked those movies.
But they are probably more interested in movies they haven't watched. Which are still quite many.
Since we now know the ids of the not-watched movies, we have to put them into a form that we can pass to the model to make the prediction. So we convert the movie indices into a tensor and create a tensor with multiple entries of the users index.
Let's get the predicted ratings of that user for all movies that they haven't watched.
Now we can see which movies they probably rate highest.
And display their names.