Published on

Movie Recommendation

Authors

View Project here: Project Repo

Objective

This project aims to train a Machine learning model to learn the user's preferences from the Movies rating dataset and then recommend a movie for any user based on its learning. As this project deals with the huge dataset, big data tools like Spark framework has been leveraged.

Dataset

The dataset used in this project is fetched from MovieLens 20M Dataset

The datasets describe ratings and free-text tagging activities from MovieLens, a movie recommendation service. It contains 20000263 ratings and 465564 tag applications across 27278 movies. These data were created by 138493 users between January 09, 1995 and March 31, 2015. This dataset was generated on October 17, 2016.Users were selected at random for inclusion. All selected users had rated at least 20 movies.

User Based Collaborative Filtering

Training RMSE Test RMSE

User to User Collaborative Filtering is a kind of machine learning technique used to predict the items that a user might like on the basis of ratings given to that item by the other users who have similar taste with that of the target user.

In this project, the users with similar taste of the user ii is referred as neighbors of user ii. Each neighbor of user ii is referred as user ii^{\prime}. We find the similarity score or the weight of the similarity between user ii and neighbor user ii^{\prime} is calculated as follows:

wii=jΨii(rijrˉi)(rijrˉi)jΨii(rijrˉi)2jΨii(rijrˉi)2w_{ii^{\prime}} = \frac{ \sum\limits_{j\in\Psi_{ii^{\prime}}} (r_{ij} - \bar{r}_{i}) (r_{i^{\prime}j} - \bar{r}_{i^{\prime}}) } { \sqrt{\sum\limits_{j\in\Psi_{ii^{\prime}}} (r_{ij} - \bar{r}_{i})^2} \sqrt{\sum\limits_{j\in\Psi_{ii^{\prime}}} (r_{i^{\prime}j} - \bar{r}_{i^{\prime}})^2} }

where,

  • Ψi\Psi_{i} set of movies that user ii has rated
  • Ψi\Psi_{i^{\prime}}: set of movies that user ii^{\prime} has rated
  • Ψii\Psi_{ii^{\prime}} set of movies that both users ii and ii^{\prime} have rated Ψii=ΨiΨi\Psi_{ii^{\prime}} = \Psi_{i} \cap \Psi_{i^{\prime}}
  • (rijrˉi)(r_{ij} - \bar{r}_{i}) is the deviation score of the user's rating on the movie j from his/her average rating. This is because each user's interpretation of rating can be different. Hence, we focus on the deviation score for each movie rating to see how much it deviates from his average rating.

The predicted rating score for a user ii on a movie jj is calculated as follows:

S(i,j)=rˉi+i=Ωjwij(rijrˉi)i=ΩjwiiS(i, j) = \bar{r}_{i} + \frac{\sum\limits_{i^{\prime}=\Omega_j} w_{ij} (r_{ij} - \bar{r}_{i^{\prime}})} {\sum\limits_{i^{\prime}=\Omega_j} |w_{ii^{\prime}}|}

This score give an estimate of how much the user ii would have rated the movie jj based on the ratings of his/her weighted neighbors iΩji^{\prime} \in \Omega_{j} on movie jj.

The weighted relationship across each user's based on their similarity is depicted in a graph as follows:

user-user-relationship-img

Item-Based Collaborative Filtering

Training RMSE Test RMSE

Item-item collaborative filtering is one kind of recommendation method which is used to predict the items that a user might like on the basis of ratings given to the similar items the target user.

In this project, the movies that are similar to the movie jj is referred as movie jj^{\prime}. The similarity score between movies is calculated as follows:

wjj=iΩjj(rijrˉj)(rijrˉj)jΩjj(rijrˉj)2jΩjj(rijrˉj)2w_{jj^{\prime}} = \frac{ \sum\limits_{i\in\Omega_{jj^{\prime}}} (r_{ij} - \bar{r}_{j}) (r_{ij^{\prime}} - \bar{r}_{j^{\prime}}) } { \sqrt{\sum\limits_{j\in\Omega_{jj^{\prime}}} (r_{ij} - \bar{r}_{j})^2} \sqrt{\sum\limits_{j\in\Omega_{jj^{\prime}}} (r_{ij^{\prime}} - \bar{r}_{j^{\prime}})^2} }

where,

  • Ωj\Omega_{j} - users who rated movie jj
  • Ωjj\Omega_{jj^{\prime}} - users who rated movie jj and movie jj^{\prime}
  • rˉj\bar{r}_{j} - average rating for movie jj

The predicted rating score for a user ii on a movie jj is calculated as follows:

S(i,j)=rˉj+j=Ψiwjj(rijrˉj)j=ΨiwjjS(i, j) = \bar{r}_{j} + \frac{\sum\limits_{j^{\prime}=\Psi_i} w_{jj^{\prime}} (r_{ij^{\prime}} - \bar{r}_{j^{\prime}})} {\sum\limits_{j^{\prime}=\Psi_i} |w_{jj^{\prime}}|}

where,

  • Ψi\Psi_i - movies that user i has rated

This score give an estimate of how much the user ii would have rated the movie jj based on his/her ratings on movies jj^{\prime} which are similar to movie jj.

The weighted relationship across each movie based on their similarity is depicted in a graph as follows:

item-item-relationship-img

Matrix Factorization

Training MSE Test MSE