Capstone1: MovieLens project

Abstract

Recommender systems are one of the most successful and widespread application of machine learning technologies in business. There are a subsclass of information filtering system that seek to predict the “rating” or “preference” a user would give to an item. One of the most famous success story of the recommender system is the Netflix competition launched on october 2006. In 2009, at the end of the challenge , Netflix awarded a one million dollar prize to a developer team for an algorithm that increased the accuracy of the company’s recommendation engine by 10%. Many well-known recommendation algorithms, such as latent factor models, were popularized by the Netflix contest. The Netflix prize contest is become notable for its numerous contributions to the data science community. With the 10M version of movielens data , and following some ML techniques that went into the winning solution for the Netflix competition, our findings suggest that Linear regression model with regularized movie and user effects, and recommender engine using Matrix factorization with Stochastic Gradient descent lead us to the smallest Root Mean Squared Error(RMSE), 0.8648170 and 0.7826226 ,respectively.

I. Introduction

According to Aggarwal(2016) , the recommendation problem may be formulated in various ways, among which the two main are as follows: The first approach , the “prediction version of problem” aims to predict the rating value for a user-item combination. It is also referred to as “matrix completion problem”. The second approach, the “ranking version of problem” seeks to recommend the top-k items for a particular user, or determine the top-k users to target for a particular item.

For the movielens project, we use the 10M version of the MovieLens dataset generated by the GroupLens research lab. We aim to create our own recommendation system using the “prediction version of problem”. We are going to train our algorithms using the inputs in edx set to predict movie ratings in the validation set. RMSE will be used to evaluate how close our predictions are to the true values in the validation set.This approach will be studied and the features processed and analyzed with different Machine Learning techniques, focusing on regression models, ensemble methods (gradient boosting, random forest ) and recommender engines ( slopeOne, matrix factorization with gradient descent) following the main steps of a data mining problem as described in Ricci et al (2015).

II. Dataset and executive summary

1.Overview.

Loading library and data

library(tidyverse)
library(caret)
library(data.table)
library(kableExtra)
library(lubridate)
library(Matrix.utils)
library(DT)
library(wordcloud) 
library(RColorBrewer) 
library(ggthemes) 
library(irlba)
library(recommenderlab)
library(SlopeOne)
library(recosystem)
library(h2o)

load("EnvCapstone1.RData") #load output of Capstone_Movielens_data.R

edx set

class(edx)

 "data.frame"

glimpse(edx)

Observations: 9,000,055
Variables: 6
$ userId    <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
$ movieId   <dbl> 122, 185, 292, 316, 329, 355, 356, 362, 364, 370, 37...
$ rating    <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5...
$ timestamp <int> 838985046, 838983525, 838983421, 838983392, 83898339...
$ title     <chr> "Boomerang (1992)", "Net, The (1995)", "Outbreak (19...
$ genres    <chr> "Comedy|Romance", "Action|Crime|Thriller", "Action|D...

validation set

class(validation)

 "data.frame"

glimpse(validation)

Observations: 999,999
Variables: 6
$ userId    <int> 1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5...
$ movieId   <dbl> 231, 480, 586, 151, 858, 1544, 590, 4995, 34, 432, 4...
$ rating    <dbl> 5.0, 5.0, 5.0, 3.0, 2.0, 3.0, 3.5, 4.5, 5.0, 3.0, 3....
$ timestamp <int> 838983392, 838983653, 838984068, 868246450, 86824564...
$ title     <chr> "Dumb & Dumber (1994)", "Jurassic Park (1993)", "Hom...
$ genres    <chr> "Comedy", "Action|Adventure|Sci-Fi|Thriller", "Child...

After generating codes provided in the project overview, we can see that the edX dataset is made of 6 features for a total of about 9,000,055 observations.The validation set which represents 10% of the 10M Movielens dataset contains the same features , but with a total of 999,999 occurences. we made sure that userId and movieId in edx set are also in validation set.

Each row represents a rating given by one user to one movie. The column “rating” is the outcome we want to predict, y. Taking into account both datasets, here are the features and their characteristics:

quantitative features

-userId : discrete, Unique ID for the user.

-movieId: discrete, Unique ID for the movie.

-timestamp : discrete , Date and time the rating was given.

qualitative features

-title: nominal , movie title (not unique)

-genres: nominal, genres associated with the movie.

outcome,y

-rating : continuous, a rating between 0 and 5 for the movie.

2.Data exploration

summary(edx)

     userId         movieId          rating        timestamp        
 Min.   :    1   Min.   :    1   Min.   :0.500   Min.   :7.897e+08  
 1st Qu.:18124   1st Qu.:  648   1st Qu.:3.000   1st Qu.:9.468e+08  
 Median :35738   Median : 1834   Median :4.000   Median :1.035e+09  
 Mean   :35870   Mean   : 4122   Mean   :3.512   Mean   :1.033e+09  
 3rd Qu.:53607   3rd Qu.: 3626   3rd Qu.:4.000   3rd Qu.:1.127e+09  
 Max.   :71567   Max.   :65133   Max.   :5.000   Max.   :1.231e+09  
    title              genres         
 Length:9000055     Length:9000055    
 Class :character   Class :character  
 Mode  :character   Mode  :character

outcome, y : rating

#i create a dataframe "explore_ratings" which contains half star and whole star ratings  from the edx set : 

group <-  ifelse((edx$rating == 1 |edx$rating == 2 | edx$rating == 3 | 
                  edx$rating == 4 | edx$rating == 5) ,
                   "whole_star", 
                   "half_star") 

explore_ratings <- data.frame(edx$rating, group)

# histogram of ratings

ggplot(explore_ratings, aes(x= edx.rating, fill = group)) +
  geom_histogram( binwidth = 0.2) +
  scale_x_continuous(breaks=seq(0, 5, by= 0.5)) +
  scale_fill_manual(values = c("half_star"="purple", "whole_star"="brown")) +
  labs(x="rating", y="number of ratings", caption = "source data: edx set") +
  ggtitle("histogram : number of ratings for each rating")

Exploring ratings of the edx set , we notice the following facts:

+ the average user's activity reveals that no user gives 0 as rating
+ the top 5 ratings from most to least are :  4, 3, 5, 3.5 and 2.
+ the histogram shows that the half star ratings are less common than whole star ratings.

qualitative features: genres, title

Now, we are going to explore the features “genres” and “title” of our edx set.

# i create a data frame top_genr which contains each single genre without their combinations ( ex : Action and not "Action|Crime|Thriller" , Comedy and not  "Comedy|Romance") . In this way i can visualize which single genre has more ratings . I plot a word cloud.

top_genr <- edx %>% separate_rows(genres, sep = "\\|") %>%
  group_by(genres) %>%
  summarize(count = n()) %>%
  arrange(desc(count))

datatable(top_genr, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T) ) %>%
  formatRound('count',digits=0, interval = 3, mark = ",")

layout(matrix(c(1,2), nrow =2) , heights = c(1,4))
par(mar=rep(0,4))
plot.new()
text(x=0.5,y=0.5, "top Genres by number of ratings")
wordcloud(words=top_genr$genres,freq=top_genr$count,min.freq=50,
          max.words = 20,random.order=FALSE,random.color=FALSE,
          rot.per=0.35,colors = brewer.pal(8,"Dark2"),scale=c(5,.2),
          family="plain",font=2,
          main = "Top genres by number of ratings")

we notice that the “Drama” genre has the top number of movies ratings, followed by the “Comedy” and the “Action” genres. There’s a last category (where the genre is not listed) which contains only 7 movies ratings.

# the data frame top_title contains the top 20 movies which count the major number of ratings

top_title <- edx %>%
  group_by(title) %>%
  summarize(count=n()) %>%
  top_n(20,count) %>%
  arrange(desc(count))

# with the head function i output the top 5 

kable(head(edx %>%
     group_by(title,genres) %>%
     summarize(count=n()) %>%
     top_n(20,count) %>%
     arrange(desc(count)) ,
     5)) %>%
  kable_styling(bootstrap_options = "bordered", full_width = F , position ="center") %>%
  column_spec(1,bold = T ) %>%
  column_spec(2,bold =T) %>%
  column_spec(3,bold=T)

title	genres	count
Pulp Fiction (1994)	Comedy\|Crime\|Drama	31362
Forrest Gump (1994)	Comedy\|Drama\|Romance\|War	31079
Silence of the Lambs, The (1991)	Crime\|Horror\|Thriller	30382
Jurassic Park (1993)	Action\|Adventure\|Sci-Fi\|Thriller	29360
Shawshank Redemption, The (1994)	Drama	28015

#bar chart of top_title

top_title %>% 
  ggplot(aes(x=reorder(title, count), y=count)) +
  geom_bar(stat='identity', fill="blue") + coord_flip(y=c(0, 40000)) +
  labs(x="", y="Number of ratings") +
  geom_text(aes(label= count), hjust=-0.1, size=3) +
  labs(title="Top 20 movies title based \n on number of ratings" , caption = "source data: edx set")

The inspection of the “title” attribute goes in accordance with the previous analysis. The movies which have the highest number of ratings are in the top genres categories : movies like Pulp fiction (1994), Forrest Gump(1994) or Jurrasic Park(1993) which are in the top 5 of movie’s ratings number , are part of the Drama, Comedy or Action genres.

# i take the original column "genre" from the edx set , whatever combination appears in this column .
# i compute the average and standard error for each "genre". i Plot these as error bar plots for genres with more than 100000 ratings.

edx %>% group_by(genres) %>%
  summarize(n = n(), avg = mean(rating), se = sd(rating)/sqrt(n())) %>%
  filter(n >= 100000) %>% 
  mutate(genres = reorder(genres, avg)) %>%
  ggplot(aes(x = genres, y = avg, ymin = avg - 2*se, ymax = avg + 2*se)) + 
  geom_point() +
  geom_errorbar() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "error bar plots by genres" , caption = "source data : edx set") +
  theme(
    panel.background = element_rect(fill = "lightblue",
                                    colour = "lightblue",
                                    size = 0.5, linetype = "solid"),
    panel.grid.major = element_line(size = 0.5, linetype = 'solid',
                                    colour = "white"), 
    panel.grid.minor = element_line(size = 0.25, linetype = 'solid',
                                    colour = "white")
  )

We observe that the generated plot shows strong evidence of a genre effect .

quantitative features: UserId, movieId, timestamp

edx %>%
  summarize(n_users = n_distinct(userId),
            n_movies = n_distinct(movieId))

  n_users n_movies
1   69878    10677

Even if each row represents a rating given by one user to one movie, the number of uniques values for the userId is 69878 and for the movieId 10677 : Both usersId and movieId which are presented as integer should be presumably treat as factors for some analysis purposes. Also, this means that there are less movies provided for ratings than users that rated them . If we think in terms of a large matrix, with user on the rows and movies on the columns, a challenge we face is the sparsity of our matrix. This large matrix will contain many empty cells. More over, we face a curse of dimensionality problem .These issues should be treat in our further analysis.

# histogram of number of ratings by movieId

edx %>% 
  count(movieId) %>% 
  ggplot(aes(n)) + 
  geom_histogram( bins=30, color = "red") +
  scale_x_log10() + 
  ggtitle("Movies") +
  labs(subtitle  ="number of ratings by movieId", 
       x="movieId" , 
       y="number of ratings", 
       caption ="source data : edx set") +
  theme(panel.border = element_rect(colour="black", fill=NA))

# histogram of number of ratings by userId


edx %>% 
  count(userId) %>% 
  ggplot(aes(n)) + 
  geom_histogram( bins=30, color = "gold") +
  scale_x_log10() + 
  ggtitle("Users") +
  labs(subtitle ="number of ratings by UserId", 
       x="userId" , 
       y="number of ratings") +
  theme(panel.border = element_rect(colour="black", fill=NA))

Visual exploration of the number of ratings by movieId on one hand and of the number of ratings by userId on the other hand shows the following relationships : some movies get rated more than others, and some users are more active than others at rating movies. This should presumably explain the presence of movies effects and users effects.

#we know that the edx set contains the timestamp variable which represents the time and data in which the rating was provided. The units are seconds since January 1, 1970. with the as_datetime function in the lubridate package , we can have each timestamp in the right format . I then use the point geom to create scatterplot of y = average ratings vs x  = date ,  and smooth geom to aid the eyes in seeing patterns in the presence of overplotting.


edx %>% 
  mutate(date = round_date(as_datetime(timestamp), unit = "week")) %>%
  group_by(date) %>%
  summarize(rating = mean(rating)) %>%
  ggplot(aes(date, rating)) +
  geom_point() +
  geom_smooth() +
  ggtitle("Timestamp, time unit : week")+
  labs(subtitle = "average ratings",
       caption = "source data : edx set")

Analyzing the trend of the average ratings versus the date , we can notice that there is some evidence of a time effect in the plot, but there is not a strong effect of time.

III. Data Preprocessing

Real-life data typically needs to be preprocessed (e.g. cleansed, filtered, transformed) in order to be used by the machine learning techniques in the analysis step.

In this section, we will focus mainly on data preprocessing techniques that are of particular importance when designing a Recommender system. These techniques include similarity measures (such as Euclidean distance, Cosine distance,etc) , sampling methods , and dimensionality reduction (such as PCA or SVD) . We will use them when necessary.

First avoid, we are going to create the rating matrix . In the Data exploration step, we already higlighted the sparsity problem when considering a large matrix with users on the rows and movies on the columns. Let’s build effectively that matrix.

1.Data transformation

Trying to build our matrix, we somehow encountered that with the huge amount of data we have, the dcast , acast functions of the reshape2 and data.table packages are very time consuming and don’t allocate vectors of size more than 2.8G. Then, we decided to go further with the Matrix packages : Matrix and Matrix.utils which contain the sparseMatrix function. The latter is less time consuming and deal more efficiently with the memory problem .

# As we said in the Data exploration step,  usersId and movieId should be treat as factors for some analysis purposes. To perform this transformation we make a copy of edx set, since we want to keep unchanged our original training set.(edx)

edx.copy <- edx

edx.copy$userId <- as.factor(edx.copy$userId)
edx.copy$movieId <- as.factor(edx.copy$movieId)

# i used the SparseMatrix function . The output is a sparse matrix of class dgcMatrix.
# In order to use this function, i need to convert userId & movieId into numeric vectors.

edx.copy$userId <- as.numeric(edx.copy$userId)
edx.copy$movieId <- as.numeric(edx.copy$movieId)

sparse_ratings <- sparseMatrix(i = edx.copy$userId,
                         j = edx.copy$movieId ,
                         x = edx.copy$rating, 
                         dims = c(length(unique(edx.copy$userId)),
                                  length(unique(edx.copy$movieId))),  
                         dimnames = list(paste("u", 1:length(unique(edx.copy$userId)), sep = ""), 
                                        paste("m", 1:length(unique(edx.copy$movieId)), sep = "")))


# i can remove the copy created
rm(edx.copy)

#give a look on the first 10 users
sparse_ratings[1:10,1:10]

10 x 10 sparse Matrix of class "dgCMatrix"
                         
u1  . .   . . . . . . . .
u2  . .   . . . . . . . .
u3  . .   . . . . . . . .
u4  . .   . . . . . . . .
u5  1 .   . . . . 3 . . .
u6  . .   . . . . . . . .
u7  . .   . . . . . . . .
u8  . 2.5 . . 3 4 . . . .
u9  . .   . . . . . . . .
u10 . .   . . . . 3 . . .

#Convert rating matrix into a recommenderlab sparse matrix
ratingMat <- new("realRatingMatrix", data = sparse_ratings)
ratingMat

69878 x 10677 rating matrix of class 'realRatingMatrix' with 9000055 ratings.

2.Similarity Measures

More often, data mining techniques that are used for the modelling of different recommender systems approaches( collaborative filtering, content based, hybrid methods) are highly dependent on defining an appropriate similarity or distance measure.

To measure the similarity between users or between items, it is possible to use the following:

+ Minkowski Distance
+ Mahalanobis distance
+ Pearson correlation
+ Cosine similarity

we will use the cosine similarity which is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. According to Ricci et al(2015), it is defined as follow :

\[\cos(\pmb x, \pmb y) = \frac {\pmb x \cdot \pmb y}{||\pmb x|| \cdot ||\pmb y||}\]

where \(\cdot\) indicates vector dot product and ||x|| is the norm of vector x .

The main advantages using this distance measure , as developed in Saranya et al (2016) are :

Solves the problem of sparsity, scalability and cold start and it is more robust to noise.
It improves prediction accuracy and consistency
The Cosine similarity can still be calculated even though the matrix has many missing elements.
As the dimensional space becomes large, this still works well.
The low Computational complexity , especially for sparse vectors.

Because of our huge amount of data, we calculate our similarity on the first 50 users for the visualization.

#i calculate the user similarity using the cosine similarity

similarity_users <- similarity(ratingMat[1:50,], 
                               method = "cosine", 
                               which = "users")

image(as.matrix(similarity_users), main = "User similarity")



#Using the same approach, I compute similarity between  movies.

similarity_movies <- similarity(ratingMat[,1:50], 
                               method = "cosine", 
                               which = "items")

image(as.matrix(similarity_movies), main = "Movies similarity")

In the given matrices, each row and each column corresponds to a user, and each cell corresponds to the similarity between two users . The more red the cell is, the more similar two users are. Note that the diagonal is red, since it’s comparing each user with itself.

When we observe the two similarity matrices, they leave out the following plausible analysis : since there are more similar ratings between certain users than others, and more similar ratings between certain movies than others, we can evidence the existence of a group of users pattern or a group of movies pattern.

3.Dimension reduction

The analysis of the previous similarity matrices leads to the thought that it can exists users with similar ratings patterns and movies with similar rating patterns. However Sparsity and the curse of dimensionality remain a recurent problem, and we have to deal with many NA too. Dimensionality reduction techniques such as “pca” and “svd” can help overcome these problem by transforming the original high-dimensional space into a lower-dimensionality.

To face the RAM memory problem, we are going to use the Irlba package, which it is a fast and memory-efficient way to compute a partial SVD. The augmented implicitly restarted Lanczos bidiagonalization algorithm (IRLBA) finds a few approximate largest (or, optionally, smallest) singular values and corresponding singular vectors of a sparse or dense matrix using a method of Baglama and Reichel

According to Irizarry,R 2018 Matrix factorization,github page,accessed 15 January 2019, https://rafalab.github.io/dsbook/matrix-factorization.html , the SVD tells us that we can decompose a \(N × P\) matrix \(Y\) with \(P < N\) as follow : \[Y = UDV^T\] , where

     + U : orthogonal matrix of dimensions N x m 
     + D : diagonal matrix containing the singular values of the original matrix, m x m 
     + V : orthogonal matrix of dimensions m x P

set.seed(1)
Y <- irlba(sparse_ratings,tol=1e-4,verbose=TRUE,nv = 100, maxit = 1000)

# plot singular values

plot(Y$d, pch=20, col = "blue", cex = 1.5, xlab='Singular Value', ylab='Magnitude', 
     main = "Singular Values for User-Movie Matrix")

# calculate sum of squares of all singular values
all_sing_sq <- sum(Y$d^2)

# variability described by first 6, 12, and 20 singular values
first_six <- sum(Y$d[1:6]^2)
print(first_six/all_sing_sq)

 0.6187623

first_12 <- sum(Y$d[1:12]^2)
print(first_12/all_sing_sq)

 0.7052297

first_20 <- sum(Y$d[1:20]^2)
print(first_20/all_sing_sq)

 0.7646435

perc_vec <- NULL
for (i in 1:length(Y$d)) {
  perc_vec[i] <- sum(Y$d[1:i]^2) / all_sing_sq
}

plot(perc_vec, pch=20, col = "blue", cex = 1.5, xlab='Singular Value', ylab='% of Sum of Squares of Singular Values', main = "Choosing k for Dimensionality Reduction")
lines(x = c(0,100), y = c(.90, .90))

First six singular values explain more than half of the variability of the imputed ratings matrix, with the first dozen explaining nearly 70% and the first twenty explaining more than 75%. However,the goal is to identify the first k singular values whose squares sum to at least 90% of the total of the sums of the squares of all of the singular values. A plot of a running sum of squares for the singular values shows that the 90% hurdle is achieved using somewhere between 50 and 60 singular values.

#To find the exact value of k, i calculate  the length of the vector that remains from our running sum of squares after excluding any items within that vector that exceed 0.90.

k = length(perc_vec[perc_vec <= .90])
k

#I get the decomposition of Y ; matrices U, D, and V accordingly:

U_k <- Y$u[, 1:k]
dim(U_k)

 69878    55

D_k <- Diagonal(x = Y$d[1:k])
dim(D_k)

 55 55

V_k <- t(Y$v)[1:k, ]
dim(V_k)

    55 10677

We notice that k=55 will retain 90% of variability . As we can see above, we now have a matrix \(D_k\) of size 55 x 55, a matrix \(U_k\) of size 69878 x 55, and a matrix \(V_k\) of size 55 x 10677. Therefore, the total number of numeric values required to house these component matrices is \((69878 * 55) + (55 * 55) + (55 * 10677) = 4,433,550\) . This represents an approximately 50.7% decrease in required storage relative to the original 9,000,055 entries.

Reducing the dimensionality, the RAM memory problem persists. Thats why we needed to go further with another reduction technique. We selected the relevant data using the whole rating matrix.

4.Relevant Data

We know that some users saw more movies than the others. So, instead of displaying some random users and movies, we should select the most relevant users and movies. Thus we visualize only the users who have seen many movies and the movies that have been seen by many users.To identify and select the most relevant users and movies, we follow these steps:

Determine the minimum number of movies per user.
Determine the minimum number of users per movie.
Select the users and movies matching these criteria.

#a.
min_n_movies <- quantile(rowCounts(ratingMat), 0.9)
print(min_n_movies)

90% 
301

#b.
min_n_users <- quantile(colCounts(ratingMat), 0.9)
print(min_n_users)

   90% 
2150.2

#c.
ratings_movies <- ratingMat[rowCounts(ratingMat) > min_n_movies,
                            colCounts(ratingMat) > min_n_users]
ratings_movies

6978 x 1068 rating matrix of class 'realRatingMatrix' with 2313148 ratings.

we can notice that now, we have a rating matrix of 6978 distinct users (rows) x 1068 distinct movies(columns) , with 2,313,148 ratings .

The data preprocessing phase is usually not definitive because it requires a lot of attention and subsequent various explorations on the variables. It must be aimed at obtaining better predictive results and in this sense, the further phases of model evaluations can help us to understand which particular preprocessing approaches are actually indispensable or useful for a specific model purpose.

IV. Methods and Analysis

In this section, we are going to explain the methodology over differents Machine Learning algorithms we used , and present the metric for the model performance evaluation.

1.Evaluated Algorithms

1.1 Regression Models

Modelling effects

As in Irizarry,R 2018 Recommender systems,github page,accessed 5 January 2019, https://rafalab.github.io/dsbook/recommendation-systems.html, we followed the same approach to build our linear regression models as the simplest possible recommendation systems. We started from considering the same rating for all movies and users with all the differences explained by random variation \(Y_{u,i} = \mu + \varepsilon_{u,i}\) and thus, modelling successively the different effects.

movie effects : since we know that some movies are generally rated higher than others, we can augment our previous model by adding the term \(b_i\) to represent average ranking for movie \(i\): \[Y_{u,i} = \mu + b_i + \varepsilon_{u,i} (1)\] where :

° \(\mu\) the “true” rating for all movies
° \(b_i\) bs effects or bias , movie-specific effect.
° \(\varepsilon_{u,i}\) independent errors sampled from the same distribution centered at 0

movie + user effects : We also know that some users are more active than others at rating movies. This implies that an additional improvement to our model may be : \[Y_{u,i} = \mu + b_i + b_u + \varepsilon_{u,i} (2)\] where :

° \(\mu\) , \(b_i\) , \(\varepsilon_{u,i}\) are defined as in \((1)\)
° \(b_u\) user-specific effect

movie + user + time effects . As in data exploration we showed some evidence of time effect, if we define with \(d_{u,i}\) as the day for user’s \(u\) rating of movie \(i\) the new model is the following : \[Y_{u,i} = \mu + b_i + b_u + f(d_{u,i}) + \varepsilon_{u,i} (3)\] with \(f\) a smooth function of \(d_{u,i}\) . A further and detailed explanations of time effects are developped in Koren(2009). For convenience in our study , we just converted timestamp to a datetime object and rounded it to the weekly unit.

movie + user + genre effects : The same approach with the genre effect leads to the following model , \[Y_{u,i} = \mu + b_i + b_u + \sum{k=1}^K x_{u,i} \beta_k + \varepsilon_{u,i} (4)\] where \(g_{u,i}\) is defined as the genre for user’s \(u\) and \(x^k_{u,i} = 1\) if \(g_{u,i}\) is genre \(k\). We just present the model here, but we didn’t perform it in our analysis.

To fit these models and get least square estimates of \(\hat{b}_i\) , \(\hat{b}_u\) , \(\hat{b}_t\) , using the lm() function of the stats package will be very slow since there are more than 10,000 of \({b}_i\) effects and more that 60,000 \({b}_u\) effects. For the different models (1),(2),(3) we get estimates of least squares as follow :

° \(\hat{b}_i\) as the average of \(y_{u,i} - \hat{\mu}\) for each movie i and the predicted ratings are : \(\hat{y}_{u,i} = \hat{\mu} + \hat{b}_i\) \((1')\)

° \(\hat{b}_u\) as the average of \(y_{u,i} - \hat{\mu} - \hat{b}_i\) and the predicted ratings are : \(\hat{y}_{u,i} = \hat{\mu} + \hat{b}_i + \hat{b}_u\) \((2')\)

° \(\hat{b}_t\) as the average of \(y_{u,i} - \hat{\mu} - \hat{b}_i -\hat{b}_u\) and the predicted ratings are : \(\hat{y}_{u,i} = \hat{\mu} + \hat{b}_i + \hat{b}_u + \hat{b}_t\) \((3')\)

Regularization

Regularization permits us to penalize large estimates that come from small sample sizes. It has commonalities with the Bayesian approach that shrunk predictions. The general idea is to add a penalty for large values of bi, bu to the sum of squares equation that we minimize. So having many large bi or bu makes it harder to minimize.

A more accurate estimation of bu and bi will treat them symmetrically, by solving the least squares problem \[\frac{1}{N} \sum_{u,i} \left(y_{i,u} - \mu - b_i - b_u \right)^2 + \lambda \left(\sum_{i} b_i^2 + \sum_{u} b_u^2\right) (5)\] where the first term , \(\frac{1}{N} \sum_{u,i} \left(y_{i,u} - \mu - b_i - b_u \right)^2\), strives to find bu’s and bi’s that fit the given ratings. The regularizing term, \(\lambda \left(\sum_{i} b_i^2 + \sum_{u} b_u^2\right)\), avoids overfitting by penalizing the magnitudes of the parameters. This least square problem can be solved fairly efficiently by the method of stochastic gradient descent that will be object in the matrix factorization recommender engine.

At this step, we used cross validation to pick the best \(\lambda\) (which corresponds to the minimum RMSE) and using calculus we can show that the values of \(b_i\) and \(b_u\) that minimize this equation are :

\[\hat{b}_i(\lambda) = \frac{1}{\lambda + n_i} \sum_{u=1}^{n_i} \left(Y_{u,i} - \hat{\mu}\right)\]

\[\hat{b}_u(\lambda) = \frac{1}{\lambda + n_i} \sum_{u=1}^{n_i} \left(Y_{u,i} - \hat{\mu} - \hat{b}_i \right)\]

where \(n_i\) is the number of ratings made for movie \(i\).

When \(n_i\) is very large, which will give us a stable estimate, then \(\lambda\) is effectively ignored since \(n_i+\lambda \approx n_i\) . However, when \(n_i\) is small, then the estimate \(\hat{b}_i(\lambda)\) is shrunken towards 0 .The larger \(\lambda\), the more we shrink.

1.2 Recommender engines

Slope One

Slope One was introduced in a 2005 paper by Daniel Lemire and Anna Maclachlan , ‘Slope One Predictors for Online Rating-Based Collaborative Filtering’. This algorithm is one of the simplest way to perform collaborative filtering based on items’ similarity. This makes it very easy to implement and use, and accuracy of this algorithm equals to the accuracy of more complicated and resource-intensive algorithms. Slope One method operates with average differences of ratings between each item and makes predictions based on their weighted value. More details are given in the said paper.

To perform the SlopeOne method , we need to perform a specific pre-processing approach. We followed the different steps :

create a copy of edx and validation set (edx.copy and valid.copy) ,retaining only userId, movieId and rating columns.
change names and types of variables (to characters) for edx.copy and valid.copy dataset to make them suitable for the SlopeOne package.
Package SlopeOne works with data.table objects. It is pretty fast, but for huge dataset it needs a lot of RAM. Thats why we need to sample our edx.copy dataset. we create a Data partition index (a random sampling without replacement, with p = 0.3) to produce a small training set.
The rest of the steps consist to normalize the ratings, build the Slope one model and finally make prediction on our validation set.

Matrix Factorization with stochastic gradient Descent

Matrix Factorization is a popular technique to solve recommender system problem. The main idea is to approximate the matrix \(R_{m\times n}\) by the product of two matrixes of lower dimension, \(P_{k\times m}\) and \(Q_{k\times n}\) such that : \[R\approx P'Q\] . Let \(p_u\) be the u-th column of \(P\), and \(q_v\) be the v-th column of \(Q\), then the rating given by user \(u\) on item \(v\) would be predicted as \(p'_u q_v\). A typical solution for \(P\) and \(Q\) is given by the following regularization problem as defined in Chin et al(2015) : \[\min_{P,Q} \sum_{(u,v)\in R} \left[f(p_u,q_v;r_{u,v})+\mu_P||p_u||_1+\mu_Q||q_v||_1+\frac{\lambda_P}{2} ||p_u||_2^2+\frac{\lambda_Q}{2} ||q_v||_2^2\right]\] where \((u,v)\) are locations of observed entries in \(R\), \(r_{u,v}\) is the observed rating, \(f\) is the loss function, and \(\mu_P\),\(\mu_Q\),\(\lambda_P\),\(\lambda_Q\) are penalty parameters to avoid overfitting.

Matrix P represents latent factors of users. So, each k-elements column of matrix P represents each user. Each k-elements column of matrix Q represents each item . So, to find rating for item i by user u we simply need to compute two vectors: P[,u]’ x Q[,i]. Further descriptions of this technique and the recosystem package are available here

To perform our recommender system using parallel Matrix factorization with stochastic gradient descent, we followed the different steps:

As in the SlopeOne method, we created an identical copy of edx and validation set (edx.copy and valid.copy) , selecting only userId, movieId and rating columns. With the recosystem package, the data file for training set needs to be arranged in sparse matrix triplet form, i.e., each line in the file contains three numbers “user_index”, “item_index”, “rating”.
No RAM problem : Unlike most other R packages for statistical modeling that store the whole dataset and model object in memory, recosystem can significantly reduce memory use, for instance the constructed model that contains information for prediction can be stored in the hard disk, and output result can also be directly written into a file rather than be kept in memory. Thats why we simply use our whole edx.copy as train set (9,000,055 occurences) and valid.copy as validation set (999,999 occurences).
Finally, we create a model object by calling Reco() , call the tune() method to select best tuning parameters along a set of candidate values , train the model by calling the train() method and use the predict() method to compute predicted values.

In our study, before to perform SlopeOne and Matrix factorization, we also presented recommender algorithms of the recommenderlab package( UBCF, IBCF , popular). However, we will not develop them here since they were not relevant for the evaluation through the root mean squared error(RMSE) . Short description of these methods are available here or in this kernel.

1.3 Ensemble Methods

GBDT: Gradient Boosting decision trees

Gradient Boosting Machine (for Regression and Classification) is a forward learning ensemble method. The guiding heuristic is that good predictive results can be obtained through increasingly refined approximations. H2O’s GBM sequentially builds regression trees on all the features of the dataset in a fully distributed way - each tree is built in parallel.

H2O’s Gradient Boosting Algorithms follow the algorithm specified by Hastie et al (2001):

Initialize \(f_{k0} = 0, k=1,2,…,K\)

For \(m=1\) to \(M\):

1.set \(p_{k}(x)=\frac{e^{f_{k}(x)}}{\sum_{l=1}^{K}e^{f_{l}(x)}},k=1,2,…,K\)

2.For \(k=1\) to \(K\):
a. compute \(r_{ikm}=y_{ik}-p_{k}(x_{i}),i=1,2,…,N\)
b. Fit a regression tree to the targets \(r_{ikm},i=1,2,…,N\), giving terminal regions \(R_{jim},j=1,2,…,J_{m}\)
c. compute \(\gamma_{jkm}=\frac{K-1}{K} \frac{\sum_{x_{i} \in R_{jkm}}(r_{ikm})}{\sum_{x_{i} \in R_{jkm}}|r_{ikm}|(1-|r_{ikm})},j=1,2,…,J_m\)
d. Update \(f_{km}(x)=f_{k,m-1}(x)+\sum_{j=1}^{J_m}\gamma_{jkm} I(x\in R_{jkm})\)

Output \(\hat{f_{k}}(x)=f_{kM}(x),k=1,2,…,K\)

The gbm h2o function has many parameters. For example, nfolds, ntrees, max_depth or learn rate which control cross-validation , max number of trees , maximum tree depth and learning ratecan help to overcome overfitting. Instead, GBM’s parallel performance is strongly determined by the max_depth, nbins, nbins_cats parameters. In general, for datasets over 10GB, it makes sense to use 2 to 4 nodes; for datasets over 100GB, it makes sense to use over 10 nodes, and so on. Since GBDTs have a built-in ability to apply different methods to different slices of the data, we added in some predictors that help the trees make useful clusterings: not only factors vectors of users and movies, but also number of movies each user rated (n.movies_byUser) and number of users that rated each movie (n.users_bymovie).

After creating a copy of edx set with all features(same for the validation set), with the mutate function we added the n_m and n_u attributes, converted usersId and movieId into factors, split the data intro train and test , perform the gbm algorithm with some tunes on parameters, and finally make prediction on the validation set.

The next steps consisted to build :

random forest model on an h2oFrame (using the same training set as in the gbm method)
a super learner or stacked ensemble using the best previous gbm and random forest models.

For more details on Machine learning h2o algorithms (Gradient boosting, random forest,etc) and the h2o package, follow h2oalgorithms and h2o package, respectively.

2.Model performance evaluation

To assess our model performance , we seek to evaluate how close our predictions are to the true rating values in the validation set. For this, we take into account the Root Mean Square Error (RMSE).

To construct the RMSE, you first need to determine the residuals. Residuals are the difference between the actual values and the predicted values. I denoted them by \(\hat{y}_{u,i} -y_{u,i}\), where \(y_{u,i}\) is the observed value for the ith observation and \(\hat{y}_{u,i}\) is the predicted value.

They can be positive or negative as the predicted value under or over estimates the actual value. Squaring the residuals, averaging the squares, and taking the square root gives us the RMSE. We then use the RMSE as a measure of the spread of the y values about the predicted y value.

\[\mbox{RMSE} = \sqrt{\frac{1}{N} \sum_{u,i}^{} \left( \hat{y}_{u,i} - y_{u,i} \right)^2 }\]

#i define the RMSE function as:
RMSE <- function(true_ratings, predicted_ratings){
    sqrt(mean((true_ratings - predicted_ratings)^2))
  }

V. Results

1.Identifying the optimal model

1.1 Regression Models

#a.movie effect

# i calculate the average of all ratings of the edx set
mu <- mean(edx$rating)

# i calculate b_i on the training set
movie_avgs <- edx %>% 
  group_by(movieId) %>% 
  summarize(b_i = mean(rating - mu))

# predicted ratings
predicted_ratings_bi <- mu + validation %>% 
  left_join(movie_avgs, by='movieId') %>%
  .$b_i


#b.movie + user effect

#i calculate b_u using the training set 
user_avgs <- edx %>%  
  left_join(movie_avgs, by='movieId') %>%
  group_by(userId) %>%
  summarize(b_u = mean(rating - mu - b_i))

#predicted ratings
predicted_ratings_bu <- validation %>% 
  left_join(movie_avgs, by='movieId') %>%
  left_join(user_avgs, by='userId') %>%
  mutate(pred = mu + b_i + b_u) %>%
  .$pred


#c.movie + user + time effect

#i create a copy of validation set , valid, and create the date feature which is the timestamp converted to a datetime object  and  rounded by week.

valid <- validation
valid <- valid %>%
  mutate(date = round_date(as_datetime(timestamp), unit = "week")) 

# i calculate time effects ( b_t) using the training set
temp_avgs <- edx %>%
  left_join(movie_avgs, by='movieId') %>%
  left_join(user_avgs, by='userId') %>%
  mutate(date = round_date(as_datetime(timestamp), unit = "week")) %>%
  group_by(date) %>%
  summarize(b_t = mean(rating - mu - b_i - b_u))

# predicted ratings
  predicted_ratings_bt <- valid %>% 
  left_join(movie_avgs, by='movieId') %>%
  left_join(user_avgs, by='userId') %>%
  left_join(temp_avgs, by='date') %>%
  mutate(pred = mu + b_i + b_u + b_t) %>%
  .$pred

#d.  i calculate the RMSE for movies, users and time effects 

rmse_model1 <- RMSE(validation$rating,predicted_ratings_bi)  
rmse_model1

 0.9439087

rmse_model2 <- RMSE(validation$rating,predicted_ratings_bu)
rmse_model2

 0.8653488

rmse_model3 <- RMSE(valid$rating,predicted_ratings_bt)
rmse_model3

 0.8652511

we can notice that with the movie and user effects combined, our RMSE decreased by almost 10% with respect to the only movie effect. The improvement when we add the time effect is not significant,( about a decrease of 0.011%). We are going to perform regularization as explained in the Data analysis and methods section using only the movie and user effects .

#before to proceed with regularization, i just remove the object copy of validation, "valid"
rm(valid)

#e. regularization 

# remembering (5), $\lambda$ is a tuning parameter. We can use cross-validation to choose it

lambdas <- seq(0, 10, 0.25)
  
  rmses <- sapply(lambdas, function(l){
    
    mu_reg <- mean(edx$rating)
    
    b_i_reg <- edx %>% 
      group_by(movieId) %>%
      summarize(b_i_reg = sum(rating - mu_reg)/(n()+l))
    
    b_u_reg <- edx %>% 
      left_join(b_i_reg, by="movieId") %>%
      group_by(userId) %>%
      summarize(b_u_reg = sum(rating - b_i_reg - mu_reg)/(n()+l))
    
    predicted_ratings_b_i_u <- 
      validation %>% 
      left_join(b_i_reg, by = "movieId") %>%
      left_join(b_u_reg, by = "userId") %>%
      mutate(pred = mu_reg + b_i_reg + b_u_reg) %>%
      .$pred
    
    return(RMSE(validation$rating,predicted_ratings_b_i_u))
  })
  
  
  qplot(lambdas, rmses)

#For the full model, the optimal  λ is:
    
lambda <- lambdas[which.min(rmses)]
lambda

 5.25

rmse_model4 <- min(rmses)
rmse_model4

 0.864817

#summarize all the rmse on validation set for Linear regression models

rmse_results <- data.frame(methods=c("movie effect","movie + user effects","movie + user + time effects", "Regularized Movie + User Effect Model"),rmse = c(rmse_model1, rmse_model2,rmse_model3, rmse_model4))

kable(rmse_results) %>%
  kable_styling(bootstrap_options = "striped" , full_width = F , position = "center") %>%
  kable_styling(bootstrap_options = "bordered", full_width = F , position ="center") %>%
  column_spec(1,bold = T ) %>%
  column_spec(2,bold =T ,color = "white" , background ="#D7261E")

methods	rmse
movie effect	0.9439087
movie + user effects	0.8653488
movie + user + time effects	0.8652511
Regularized Movie + User Effect Model	0.8648170

we observe that regularization gets down the RMSE’s value to 0.8648170.

1.2 recommender engines

# a. POPULAR , UBCF and IBCF algorithms of the recommenderlab package

model_pop <- Recommender(ratings_movies, method = "POPULAR", 
                      param=list(normalize = "center"))

#prediction example on the first 10 users
pred_pop <- predict(model_pop, ratings_movies[1:10], type="ratings")
as(pred_pop, "matrix")[,1:10]

           m1       m2       m3       m5       m6       m7       m9
u8   3.855564       NA 2.923282       NA       NA 3.092309 2.574400
u17        NA       NA 2.969625       NA 3.802000       NA 2.620742
u28  3.271469       NA       NA       NA 3.171562 2.508213       NA
u30        NA       NA       NA 2.792031       NA 3.109045 2.591136
u43  4.664153 3.756804 3.731871 3.583884 4.564246       NA 3.382989
u48        NA       NA       NA 3.448791 4.429153 3.765804 3.247895
u57        NA       NA 2.652900 2.504913 3.485275 2.821926 2.304018
u70  4.421321 3.513973 3.489039 3.341052 4.321415 3.658066 3.140157
u88        NA 3.038295 3.013362 2.865375 3.845737 3.182388 2.664480
u103       NA 2.777942 2.753008 2.605021       NA 2.922034 2.404126
          m10      m11      m14
u8   3.314650 3.459774 3.416521
u17        NA 3.506116 3.462863
u28  2.730555 2.875678 2.832426
u30        NA       NA 3.433257
u43  4.123238       NA 4.225109
u48        NA 4.133269 4.090016
u57  3.044268       NA 3.146139
u70  3.880407 4.025531 3.982278
u88  3.404729 3.549853 3.506600
u103 3.144375 3.289499 3.246247

#Calculation of rmse for popular method 
set.seed(1)
e <- evaluationScheme(ratings_movies, method="split", train=0.7, given=-5)
#5 ratings of 30% of users are excluded for testing

model_pop <- Recommender(getData(e, "train"), "POPULAR")

prediction_pop <- predict(model_pop, getData(e, "known"), type="ratings")

rmse_popular <- calcPredictionAccuracy(prediction_pop, getData(e, "unknown"))[1]
rmse_popular

     RMSE 
0.8482917

#Estimating rmse for UBCF using Cosine similarity and selected n as 50 based on cross-validation
set.seed(1)
model <- Recommender(getData(e, "train"), method = "UBCF", 
                     param=list(normalize = "center", method="Cosine", nn=50))

prediction <- predict(model, getData(e, "known"), type="ratings")

rmse_ubcf <- calcPredictionAccuracy(prediction, getData(e, "unknown"))[1]
rmse_ubcf

     RMSE 
0.8589153

#Estimating rmse for IBCF using Cosine similarity and selected n as 350 based on cross-validation
set.seed(1)

model <- Recommender(getData(e, "train"), method = "IBCF", 
                     param=list(normalize = "center", method="Cosine", k=350))

prediction <- predict(model, getData(e, "known"), type="ratings")

rmse_ibcf <- calcPredictionAccuracy(prediction, getData(e, "unknown"))[1]
rmse_ibcf

    RMSE 
0.963769

To see how to make the choice of k in ubcf and ibcf methods based on cross-validation, read this kernel. We observe that user-based CF gives an ability to achieve lower RMSE on test set than Item-based CF and Popular methods.However, these methods don’t fill with the scope of our study since the rmse evaluation is made on a test set after partitioning. We want to predict ratings for the 999999 rows and then evaluate with RMSE how close these predictions are with respect to the true ratings values in validation set. Moreover, the new data in the predict function should be of a class “ratingMatrix”. Validation set doesn’t have to be modified.

We then moved to the SlopeOne and Matrix factorization methods.

#b. SlopeOne  

#i create copy of training(edx) and validation sets where i retain only userId, movieId and rating
edx.copy <- edx %>%
            select(-c("genres","title","timestamp"))

valid.copy <- validation %>%
              select(-c("genres","title","timestamp"))

#i rename columns and convert them to characters  for edx.copy and valid.copy sets : item_id  is #seen as movie_id

names(edx.copy) <- c("user_id", "item_id", "rating")

edx.copy <- data.table(edx.copy)

edx.copy[, user_id := as.character(user_id)]
edx.copy[, item_id := as.character(item_id)]


names(valid.copy) <- c("user_id", "item_id", "rating")

valid.copy <- data.table(valid.copy)

valid.copy[, user_id := as.character(user_id)]
valid.copy[, item_id := as.character(item_id)]


#setkey() sorts a data.table and marks it as sorted (with an attribute sorted). The sorted columns are the key. The key can be any columns in any order. The columns are sorted in ascending order always. The table is changed by reference and is therefore very memory efficient.
setkey(edx.copy, user_id, item_id)
setkey(valid.copy, user_id, item_id)


#split data to create a small training sample ( to face the RAM memory issue)
set.seed(1)
idx <- createDataPartition(y = edx.copy$rating, times = 1, p = 0.1, list = FALSE)
edx.copy_train <- edx.copy[idx,]

#normalization
ratings_train_norm <- normalize_ratings(edx.copy_train)

#clear unusued memory and increase memory limit
invisible(gc())
invisible(memory.limit(size = 56000))

#Building Slope One model:
#knitr::knit_meta(class=NULL,clean=TRUE)
model <- build_slopeone(ratings_train_norm$ratings)

invisible(gc())

#Making predictions for validation set:
  
predictions <- predict_slopeone(model, 
                                  valid.copy[ , c(1, 2), with = FALSE], 
                                  ratings_train_norm$ratings)

unnormalized_predictions <- unnormalize_ratings(normalized = ratings_train_norm, 
                                                ratings = predictions)

rmse_slopeone <- RMSE( valid.copy$rating,unnormalized_predictions$predicted_rating) 

# i remove the created copies of sets
rm(edx.copy,valid_copy,edx.copy_train)

rmse_slopeone

 1.029637

#Before to perform MF method, i clear unusued memory
invisible(gc())

#c. Matrix Factorization with parallel stochastic gradient descent

#i create a copy of training(edx) and validation sets where i retain only userId, movieId and rating features. i rename the three columns.

edx.copy <-  edx %>%
            select(-c("genres","title","timestamp"))

names(edx.copy) <- c("user", "item", "rating")


valid.copy <-  validation %>%
  select(-c("genres","title","timestamp"))

names(valid.copy) <- c("user", "item", "rating")


#as matrix
edx.copy <- as.matrix(edx.copy)
valid.copy <- as.matrix(valid.copy)


#write edx.copy and valid.copy tables on disk 
write.table(edx.copy , file = "trainset.txt" , sep = " " , row.names = FALSE, col.names = FALSE)
write.table(valid.copy, file = "validset.txt" , sep = " " , row.names = FALSE, col.names = FALSE)


#  data_file(): Specifies a data set from a file in the hard disk. 

set.seed(123) # This is a randomized algorithm
train_set <- data_file(system.file( "dat" ,"trainset.txt" , package = "recosystem"))
valid_set <- data_file(system.file( "dat" ,"validset.txt" , package = "recosystem"))


#Next step is to build Recommender object
r = Reco()


# Matrix Factorization :  tuning training set

opts = r$tune(train_set, opts = list(dim = c(10, 20, 30), lrate = c(0.1, 0.2),
                                     costp_l1 = 0, costq_l1 = 0,
                                     nthread = 1, niter = 10))
opts

## $min
## $min$dim
## [1] 30
## 
## $min$costp_l1
## [1] 0
## 
## $min$costp_l2
## [1] 0.01
## 
## $min$costq_l1
## [1] 0
## 
## $min$costq_l2
## [1] 0.1
## 
## $min$lrate
## [1] 0.1
## 
## $min$loss_fun
## [1] 0.797987
## 
## 
## $res
##    dim costp_l1 costp_l2 costq_l1 costq_l2 lrate  loss_fun
## 1   10        0     0.01        0     0.01   0.1 0.8255899
## 2   20        0     0.01        0     0.01   0.1 0.8070605
## 3   30        0     0.01        0     0.01   0.1 0.8157530
## 4   10        0     0.10        0     0.01   0.1 0.8279932
## 5   20        0     0.10        0     0.01   0.1 0.8019461
## 6   30        0     0.10        0     0.01   0.1 0.8016862
## 7   10        0     0.01        0     0.10   0.1 0.8284874
## 8   20        0     0.01        0     0.10   0.1 0.8018076
## 9   30        0     0.01        0     0.10   0.1 0.7979870
## 10  10        0     0.10        0     0.10   0.1 0.8372085
## 11  20        0     0.10        0     0.10   0.1 0.8315828
## 12  30        0     0.10        0     0.10   0.1 0.8274714
## 13  10        0     0.01        0     0.01   0.2 0.8162612
## 14  20        0     0.01        0     0.01   0.2 0.9162043
## 15  30        0     0.01        0     0.01   0.2 0.8809001
## 16  10        0     0.10        0     0.01   0.2 0.8173864
## 17  20        0     0.10        0     0.01   0.2 0.8599332
## 18  30        0     0.10        0     0.01   0.2 0.9107871
## 19  10        0     0.01        0     0.10   0.2 0.8222024
## 20  20        0     0.01        0     0.10   0.2 0.8004464
## 21  30        0     0.01        0     0.10   0.2 0.7995085
## 22  10        0     0.10        0     0.10   0.2 0.8410999
## 23  20        0     0.10        0     0.10   0.2 0.8264263
## 24  30        0     0.10        0     0.10   0.2 0.8259178

# Matrix Factorization : trains the recommender model

r$train(train_set, opts = c(opts$min, nthread = 1, niter = 20))

## iter      tr_rmse          obj
##    0       0.9737  1.1979e+007
##    1       0.8708  9.7956e+006
##    2       0.8382  9.0926e+006
##    3       0.8175  8.6875e+006
##    4       0.8021  8.4172e+006
##    5       0.7900  8.2171e+006
##    6       0.7801  8.0667e+006
##    7       0.7719  7.9424e+006
##    8       0.7650  7.8477e+006
##    9       0.7590  7.7642e+006
##   10       0.7539  7.7000e+006
##   11       0.7493  7.6431e+006
##   12       0.7451  7.5922e+006
##   13       0.7413  7.5483e+006
##   14       0.7379  7.5110e+006
##   15       0.7347  7.4773e+006
##   16       0.7318  7.4464e+006
##   17       0.7290  7.4151e+006
##   18       0.7266  7.3907e+006
##   19       0.7243  7.3692e+006

#Making prediction on validation set and calculating RMSE:

pred_file = tempfile()

r$predict(valid_set, out_file(pred_file))

## prediction output generated at C:\Users\LD493E~1.MON\AppData\Local\Temp\Rtmp2nldyL\file18505d985386

#Matrix Factorization : show first 10 predicted values
print(scan(pred_file, n = 10))

##  [1] 4.26273 5.17404 4.73851 3.44047 4.45806 2.81567 4.10844 4.34959
##  [9] 4.25262 3.45201

#valid_set
scores_real <- read.table("validset.txt", header = FALSE, sep = " ")$V3
scores_pred <- scan(pred_file)

#remove copies of training and validation sets
rm(edx.copy, valid.copy)

rmse_mf <- RMSE(scores_real,scores_pred)
rmse_mf

## [1] 0.7826226

#summarize all the rmse for recommender algorithms

rmse_results <- data.frame(methods=c("SlopeOne","Matrix factorization with GD"),rmse = c(rmse_slopeone, rmse_mf))

kable(rmse_results) %>%
  kable_styling(bootstrap_options = "striped" , full_width = F , position = "center") %>%
  kable_styling(bootstrap_options = "bordered", full_width = F , position ="center") %>%
  column_spec(1,bold = T ) %>%
  column_spec(2,bold = T ,color = "white" , background ="#D7261E")

methods	rmse
SlopeOne	1.0296368
Matrix factorization with GD	0.7826226

Performing the Matrix factorization with GD recommender method, we achieved a RMSE definitely lower than 0.8. If we calculate the percentage decrease of rmse values between the two methods, we could see that the MF method shows a decrease of more than 23% with respect to the slopeOne method. Note that the high rmse value for the SlopeOne method is also due to the small training set we used to face the RAM problem(see Methods and Analysis section - SlopeOne method). Instead, by having a larger training set, we give our algorithm a better chance of understanding the patterns in the set, rather than just learning to identify specific examples from the training set. This could save our time with validation in the long run, since we won’t be dealing with quite so much overfitting.

1.3 Ensemble Methods

#knitr::knit_meta(class=NULL,clean=TRUE)
#clear unusued memory
invisible(gc())


# i create a copy of edx set where i retain all the features
edx.copy <- edx

# i create new columns n_m (number of movies each user rated) and n_u(Number of users that rated each movie)  as defined in the Data analysis and method section
edx.copy <- edx.copy %>%
            group_by(userId) %>%
            mutate(n.movies_byUser = n())

edx.copy <- edx.copy %>%
  group_by(movieId) %>%
  mutate(n.users_bymovie = n())

# factor vectors of users and movies
edx.copy$userId <-  as.factor(edx.copy$userId)
edx.copy$movieId <- as.factor(edx.copy$movieId)

#i do the same for the validation set
valida.copy <- validation  

valida.copy <- valida.copy %>%
  group_by(userId) %>%
  mutate(n.movies_byUser = n())

valida.copy <- valida.copy %>%
  group_by(movieId) %>%
  mutate(n.users_bymovie = n())

valida.copy$userId  <- as.factor(valida.copy$userId)
valida.copy$movieId <- as.factor(valida.copy$movieId)

#Attempts to start and/or connect to and H2O instance 
h2o.init(
 nthreads=-1,           ## -1: use all available threads         
 max_mem_size = "10G")  ## specify the memory size for the H2O cloud

##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         2 days 19 hours 
##     H2O cluster timezone:       Europe/Berlin 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.22.1.1 
##     H2O cluster version age:    3 months and 16 days !!! 
##     H2O cluster name:           H2O_started_from_R_LD.MonoTsango_huk890 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   4.60 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         Algos, AutoML, Core V3, Core V4 
##     R Version:                  R version 3.4.2 (2017-09-28)

#Remove all prior Clusters - remove in-memory objects to free up space
h2o.removeAll()

## [1] 0

#partitioning 
splits <- h2o.splitFrame(as.h2o(edx.copy), 
                         ratios = 0.7, 
                         seed = 1)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

train <- splits[[1]]
test <- splits[[2]]

#clear unusued memory
invisible(gc())

#remove progress bar
h2o.no_progress()

#first gbm model :  ntrees = 50, max depth = 5, learn rate = 0.1 , nfolds = 3 
gbdt_1 <- h2o.gbm( x = c("movieId","userId","n.movies_byUser","n.users_bymovie") ,
           y = "rating" , 
           training_frame = train , 
           nfolds = 3)

summary(gbdt_1)

Model Details:
==============

H2ORegressionModel: gbm
Model Key:  GBM_model_R_1554985446584_5 
Model Summary: 
  number_of_trees number_of_internal_trees model_size_in_bytes min_depth
1              50                       50               40499         5
  max_depth mean_depth min_leaves max_leaves mean_leaves
1         5    5.00000         31         32    31.96000

H2ORegressionMetrics: gbm
** Reported on training data. **

MSE:  0.993987
RMSE:  0.9969889
MAE:  0.7947012
RMSLE:  0.2649048
Mean Residual Deviance :  0.993987



H2ORegressionMetrics: gbm
** Reported on cross-validation data. **
** 3-fold cross-validation on training data (Metrics computed for combined holdout predictions) **

MSE:  0.9881614
RMSE:  0.9940631
MAE:  0.792184
RMSLE:  0.2641577
Mean Residual Deviance :  0.9881614


Cross-Validation Metrics Summary: 
                             mean           sd cv_1_valid  cv_2_valid
mae                    0.79218405 2.8220305E-4 0.79227304  0.79262227
mean_residual_deviance  0.9881618  8.733483E-4  0.9878371  0.98981047
mse                     0.9881618  8.733483E-4  0.9878371  0.98981047
r2                     0.12153444 4.4647796E-4 0.12158577 0.120736726
residual_deviance       0.9881618  8.733483E-4  0.9878371  0.98981047
rmse                    0.9940631 4.3923056E-4 0.99389994   0.9948922
rmsle                   0.2641576 1.9758147E-4 0.26409227  0.26452777
                        cv_3_valid
mae                      0.7916569
mean_residual_deviance   0.9868378
mse                      0.9868378
r2                     0.122280814
residual_deviance        0.9868378
rmse                     0.9933971
rmsle                   0.26385275

Scoring History: 
             timestamp          duration number_of_trees training_rmse
1  2019-04-14 09:54:37  6 min 34.245 sec               0       1.06060
2  2019-04-14 09:54:39  6 min 36.048 sec               1       1.05347
3  2019-04-14 09:54:40  6 min 37.800 sec               2       1.04763
4  2019-04-14 09:54:45  6 min 41.968 sec               5       1.03557
5  2019-04-14 09:54:49  6 min 45.983 sec               8       1.02759
6  2019-04-14 09:54:53  6 min 50.112 sec              11       1.02167
7  2019-04-14 09:54:57  6 min 54.125 sec              14       1.01835
8  2019-04-14 09:55:02  6 min 59.305 sec              18       1.01422
9  2019-04-14 09:55:06  7 min  3.396 sec              21       1.01137
10 2019-04-14 09:55:10  7 min  7.479 sec              24       1.00976
11 2019-04-14 09:55:14  7 min 11.499 sec              27       1.00751
12 2019-04-14 09:55:18  7 min 15.503 sec              30       1.00611
13 2019-04-14 09:55:22  7 min 19.577 sec              33       1.00428
14 2019-04-14 09:55:27  7 min 24.751 sec              37       1.00208
15 2019-04-14 09:55:33  7 min 29.953 sec              41       1.00027
16 2019-04-14 09:55:38  7 min 35.106 sec              45       0.99838
17 2019-04-14 09:55:43  7 min 40.216 sec              49       0.99729
18 2019-04-14 09:55:44  7 min 41.798 sec              50       0.99699
   training_mae training_deviance
1       0.85569           1.12487
2       0.85010           1.10979
3       0.84584           1.09752
4       0.83570           1.07241
5       0.82751           1.05595
6       0.82074           1.04380
7       0.81653           1.03703
8       0.81179           1.02865
9       0.80872           1.02287
10      0.80711           1.01962
11      0.80496           1.01509
12      0.80364           1.01225
13      0.80189           1.00858
14      0.79983           1.00416
15      0.79803           1.00054
16      0.79612           0.99675
17      0.79498           0.99458
18      0.79470           0.99399

Variable Importances: (Extract with `h2o.varimp`) 
=================================================

Variable Importances: 
         variable relative_importance scaled_importance percentage
1         movieId      2432895.250000          1.000000   0.560556
2 n.users_bymovie      1524796.250000          0.626741   0.351323
3 n.movies_byUser       342112.750000          0.140620   0.078825
4          userId        40345.531250          0.016583   0.009296

#second gbm model :  ntrees = 100, max depth = 5, learn rate = 0.1 , nfolds = 5 
# gbdt_2 <- h2o.gbm( x = c("movieId","userId") ,
#            y = "rating" , 
#            training_frame = train , 
#            ntrees = 100,
#            nfolds = 5) 
# 
# summary(gbdt_2)

#third gbm model :  ntrees = 50, max depth = 5, learn rate = 0.1 , nfolds = 3 
gbdt_3 <- h2o.gbm( x = c("movieId","userId") ,
           y = "rating" , 
           training_frame = train , 
           nfolds = 3,
           seed=1,
           keep_cross_validation_predictions = TRUE,
           fold_assignment = "Random") 


summary(gbdt_3)

Model Details:
==============

H2ORegressionModel: gbm
Model Key:  GBM_model_R_1554985446584_6 
Model Summary: 
  number_of_trees number_of_internal_trees model_size_in_bytes min_depth
1              50                       50               75964         5
  max_depth mean_depth min_leaves max_leaves mean_leaves
1         5    5.00000         32         32    32.00000

H2ORegressionMetrics: gbm
** Reported on training data. **

MSE:  0.9663329
RMSE:  0.9830223
MAE:  0.7816442
RMSLE:  0.2616349
Mean Residual Deviance :  0.9663329



H2ORegressionMetrics: gbm
** Reported on cross-validation data. **
** 3-fold cross-validation on training data (Metrics computed for combined holdout predictions) **

MSE:  0.9742736
RMSE:  0.987053
MAE:  0.7853317
RMSLE:  0.2625142
Mean Residual Deviance :  0.9742736


Cross-Validation Metrics Summary: 
                             mean           sd cv_1_valid cv_2_valid
mae                     0.7853309 9.0927485E-4 0.78488976 0.78707933
mean_residual_deviance  0.9742714 0.0023254724  0.9733581  0.9786775
mse                     0.9742714 0.0023254724  0.9733581  0.9786775
r2                     0.13388452  0.001791772 0.13440077 0.13055533
residual_deviance       0.9742714 0.0023254724  0.9733581  0.9786775
rmse                    0.9870505 0.0011775987 0.98658913  0.9892813
rmsle                  0.26251355  3.044451E-4  0.2623934 0.26309055
                       cv_3_valid
mae                     0.7840236
mean_residual_deviance  0.9707787
mse                     0.9707787
r2                     0.13669747
residual_deviance       0.9707787
rmse                     0.985281
rmsle                  0.26205668

Scoring History: 
             timestamp          duration number_of_trees training_rmse
1  2019-04-14 10:01:40  4 min 50.030 sec               0       1.06060
2  2019-04-14 10:01:42  4 min 51.451 sec               1       1.05580
3  2019-04-14 10:01:43  4 min 52.718 sec               2       1.05189
4  2019-04-14 10:01:49  4 min 58.693 sec               3       1.04845
5  2019-04-14 10:01:54  5 min  3.532 sec               8       1.03584
6  2019-04-14 10:01:59  5 min  8.351 sec              13       1.02770
7  2019-04-14 10:02:03  5 min 12.484 sec              16       1.02267
8  2019-04-14 10:02:08  5 min 17.282 sec              21       1.01622
9  2019-04-14 10:03:58  7 min  7.633 sec              24       1.01206
10 2019-04-14 10:04:14  7 min 24.026 sec              36       0.99583
11 2019-04-14 10:04:19  7 min 29.111 sec              41       0.99139
12 2019-04-14 10:04:24  7 min 34.045 sec              45       0.98676
13 2019-04-14 10:04:29  7 min 38.826 sec              49       0.98375
14 2019-04-14 10:04:30  7 min 40.141 sec              50       0.98302
   training_mae training_deviance
1       0.85569           1.12487
2       0.85150           1.11471
3       0.84810           1.10647
4       0.84494           1.09925
5       0.83230           1.07297
6       0.82359           1.05616
7       0.81876           1.04585
8       0.81289           1.03270
9       0.80898           1.02426
10      0.79409           0.99168
11      0.78988           0.98286
12      0.78531           0.97370
13      0.78228           0.96777
14      0.78164           0.96633

Variable Importances: (Extract with `h2o.varimp`) 
=================================================

Variable Importances: 
  variable relative_importance scaled_importance percentage
1  movieId      5096206.500000          1.000000   0.969387
2   userId       160936.781250          0.031580   0.030613

#Since the model gbdt_3  has the lower RMSE on training set,   

# i evaluate performance on test set
h2o.performance(gbdt_3, test)

H2ORegressionMetrics: gbm

MSE:  0.9659998
RMSE:  0.9828529
MAE:  0.7816576
RMSLE:  0.261309
Mean Residual Deviance :  0.9659998

#i predict ratings on validation set and evaluate RMSE
pred.ratings.gbdt_3 <- h2o.predict(gbdt_3,as.h2o(valida.copy))

rmse_gbdt <- RMSE(pred.ratings.gbdt_3, as.h2o(valida.copy$rating))
rmse_gbdt

 0.9839035

#clear unusued memory
invisible(gc())
#remove bar progress
h2o.no_progress()

# first rf model : 
rf1 <- h2o.randomForest(        
  training_frame = train,       
  x= c("movieId" ,"userId" ,"timestamp", "n.movies_byUser","n.users_bymovie"),                  
  y= "rating",                         
  ntrees = 50,                 
  max_depth = 20
)

summary(rf1)

Model Details:
==============

H2ORegressionModel: drf
Model Key:  DRF_model_R_1554985446584_7 
Model Summary: 
  number_of_trees number_of_internal_trees model_size_in_bytes min_depth
1              50                       50           106340480        20
  max_depth mean_depth min_leaves max_leaves  mean_leaves
1        20   20.00000      72728     140247 105495.26000

H2ORegressionMetrics: drf
** Reported on training data. **
** Metrics reported on Out-Of-Bag training samples **

MSE:  0.9361753
RMSE:  0.9675615
MAE:  0.7676068
RMSLE:  0.2578431
Mean Residual Deviance :  0.9361753





Scoring History: 
            timestamp          duration number_of_trees training_rmse
1 2019-04-14 10:13:07         0.009 sec               0            NA
2 2019-04-14 10:13:32        24.497 sec               1       1.02082
3 2019-04-14 10:13:43        35.832 sec               2       1.01305
4 2019-04-14 10:14:06        58.883 sec               3       1.00826
5 2019-04-14 10:14:15  1 min  7.689 sec               4       1.00373
  training_mae training_deviance
1           NA                NA
2      0.80340           1.04206
3      0.79847           1.02626
4      0.79546           1.01659
5      0.79269           1.00748

---
             timestamp          duration number_of_trees training_rmse
46 2019-04-14 10:20:55  7 min 48.304 sec              45       0.96815
47 2019-04-14 10:21:03  7 min 56.113 sec              46       0.96768
48 2019-04-14 10:21:11  8 min  3.654 sec              47       0.96775
49 2019-04-14 10:21:19  8 min 11.956 sec              48       0.96770
50 2019-04-14 10:21:27  8 min 20.243 sec              49       0.96763
51 2019-04-14 10:21:36  8 min 28.563 sec              50       0.96756
   training_mae training_deviance
46      0.76811           0.93731
47      0.76770           0.93640
48      0.76777           0.93654
49      0.76773           0.93645
50      0.76767           0.93630
51      0.76761           0.93618

Variable Importances: (Extract with `h2o.varimp`) 
=================================================

Variable Importances: 
         variable relative_importance scaled_importance percentage
1         movieId     16745293.000000          1.000000   0.381465
2 n.users_bymovie     11316529.000000          0.675804   0.257795
3 n.movies_byUser      6366762.500000          0.380212   0.145038
4          userId      4925214.000000          0.294125   0.112199
5       timestamp      4543522.500000          0.271331   0.103503

# second rf model : ntrees = 50, max.deptu = 20 ,  nfolds = 5
# rf2 <- h2o.randomForest(        
#   training_frame = train,       
#   x= c("movieId" ,"userId", "n.movies_byUser","n.users_bymovie"),                      
#   y= "rating",                         
#   ntrees = 50,                 
#   max_depth = 20,
#   nfolds = 5
#   )
# 
# summary(rf2)

#third rf model : ntrees = 50, max.depth = 20 , nfolds = 3
rf_3 <- h2o.randomForest(        
  training_frame = train,       
  x= c("movieId" ,"userId"),                      
  y= "rating", 
  nfolds=3,
  seed=1,
  keep_cross_validation_predictions = TRUE,
  fold_assignment = "Random"
)

summary(rf_3)

Model Details:
==============

H2ORegressionModel: drf
Model Key:  DRF_model_R_1554985446584_8 
Model Summary: 
  number_of_trees number_of_internal_trees model_size_in_bytes min_depth
1              50                       50           182067312        20
  max_depth mean_depth min_leaves max_leaves mean_leaves
1        20   20.00000      40110      75349 58403.06000

H2ORegressionMetrics: drf
** Reported on training data. **
** Metrics reported on Out-Of-Bag training samples **

MSE:  0.9120648
RMSE:  0.9550208
MAE:  0.7518812
RMSLE:  0.2540035
Mean Residual Deviance :  0.9120648



H2ORegressionMetrics: drf
** Reported on cross-validation data. **
** 3-fold cross-validation on training data (Metrics computed for combined holdout predictions) **

MSE:  0.9133628
RMSE:  0.9557002
MAE:  0.7524639
RMSLE:  0.2542855
Mean Residual Deviance :  0.9133628


Cross-Validation Metrics Summary: 
                             mean           sd cv_1_valid cv_2_valid
mae                      0.752463 5.8236887E-4  0.7531027  0.7529861
mean_residual_deviance 0.91336036 0.0016835167  0.9150332 0.91505456
mse                    0.91336036 0.0016835167  0.9150332 0.91505456
r2                     0.18803298 0.0013799736 0.18626863 0.18707715
residual_deviance      0.91336036 0.0016835167  0.9150332 0.91505456
rmse                    0.9556981 8.8118477E-4 0.95657367  0.9565849
rmsle                  0.25428495  2.328422E-4 0.25444627  0.2545826
                       cv_3_valid
mae                     0.7513002
mean_residual_deviance 0.90999335
mse                    0.90999335
r2                     0.19075316
residual_deviance      0.90999335
rmse                   0.95393574
rmsle                  0.25382596

Scoring History: 
            timestamp          duration number_of_trees training_rmse
1 2019-04-14 10:47:28 24 min 43.072 sec               0            NA
2 2019-04-14 10:47:42 24 min 57.091 sec               1       1.00942
3 2019-04-14 10:47:49 25 min  3.508 sec               2       1.00699
4 2019-04-14 10:47:55 25 min 10.047 sec               3       1.00384
5 2019-04-14 10:48:02 25 min 16.348 sec               4       0.99281
  training_mae training_deviance
1           NA                NA
2      0.79478           1.01893
3      0.79059           1.01402
4      0.78826           1.00769
5      0.77935           0.98566

---
             timestamp          duration number_of_trees training_rmse
46 2019-04-14 10:52:05 29 min 19.560 sec              45       0.95540
47 2019-04-14 10:52:10 29 min 24.845 sec              46       0.95528
48 2019-04-14 10:52:16 29 min 30.937 sec              47       0.95495
49 2019-04-14 10:52:22 29 min 36.816 sec              48       0.95493
50 2019-04-14 10:52:28 29 min 42.346 sec              49       0.95492
51 2019-04-14 10:52:33 29 min 47.573 sec              50       0.95502
   training_mae training_deviance
46      0.75219           0.91279
47      0.75206           0.91256
48      0.75179           0.91194
49      0.75176           0.91188
50      0.75178           0.91186
51      0.75188           0.91206

Variable Importances: (Extract with `h2o.varimp`) 
=================================================

Variable Importances: 
  variable relative_importance scaled_importance percentage
1  movieId     38668052.000000          1.000000   0.716631
2   userId     15290050.000000          0.395418   0.283369

#Since the model rf_3  has the lower RMSE on training set,   

# i evaluate performance on test set
h2o.performance(rf_3, test)

H2ORegressionMetrics: drf

MSE:  0.9087191
RMSE:  0.9532676
MAE:  0.7507651
RMSLE:  0.2534143
Mean Residual Deviance :  0.9087191

#i predict ratings on validation set and evaluate RMSE
pred.ratings.rf_3 <- h2o.predict(rf_3,as.h2o(valida.copy))

rmse_rf <- RMSE(pred.ratings.rf_3, as.h2o(valida.copy$rating))
rmse_rf

 0.9543947

#stacked Ensemble : i take the best two previous models (gbdt_3 and rf_3)

ensemble <- h2o.stackedEnsemble(x = c("movieId" ,"userId"),
                                y = "rating",
                                training_frame = train,
                                model_id = "my_ensemble_auto",
                                base_models = list(gbdt_3@model_id, rf_3@model_id))

#i predict ratings on validation set and evaluate RMSE
pred.ratings.ensemble <- h2o.predict(ensemble,as.h2o(valida.copy))

rmse_ensemble <- RMSE(pred.ratings.ensemble, as.h2o(valida.copy$rating))
rmse_ensemble

 0.9493815

rmse_results <- data.frame(methods=c("gradient Boosting","random forest","stacked ensemble"),rmse = c(rmse_gbdt, rmse_rf, rmse_ensemble))

kable(rmse_results) %>%
  kable_styling(bootstrap_options = "striped" , full_width = F , position = "center") %>%
  kable_styling(bootstrap_options = "bordered", full_width = F , position ="center") %>%
  column_spec(1,bold = T ) %>%
  column_spec(2,bold = T ,color = "white" , background ="#D7261E")

methods	rmse
gradient Boosting	0.9839035
random forest	0.9543947
stacked ensemble	0.9493815

#remove objects
rm(edx.copy,valida.copy)

#close the cluster h2o
h2o.shutdown()

## Are you sure you want to shutdown the H2O instance running at http://localhost:54321/ (Y/N)?

We observed that the best gbdt and rf models (gbdt_3 and rf_3) have the lowest Root Mean Squared errors on the training set and produced rmse values on the validation set equals to 0.9839035 and 0.9543947, respectively . For the gbm model this happened when with 50 trees and max depth equals to 5, we only take into account the movieId and UserId (factor vectors) features, and with 3 folds cross validation .For the rf model , this happened when with 3 folds cross validation we trained 50 trees with max depth equals to 20 for the movieId and UserId (factor vectors) features. However, performance appears slower in RF than in GBM since RF can go to depth 20, which can lead to up to 1+2+4+8+…+2^19 ~ 1M nodes to be split, and for every one of them, mtries=sqrt(4600)=67 columns need to be considered for splitting. Instead, GBM can go to depth 5, so only 1+2+4+8+16 = 31 nodes to be split, and for every one of them, all 4600 columns need to be considered. The rmse’s values on validation set for the gbdt , rf and stacked ensemble models are summarized in table above.

1.2 Increasing model performance

Based on the differents RMSE’s values we found for Linear regression models, recommenders algorithms and Ensemble models, we choose the two candidates which best predict ratings on the validation set : Linear regression model with Regularized movie + user effect (i) and Matrix factorization with stochastic gradient (ii).

#i create a copy of training(edx) and validation sets where i retain only userId, movieId and rating features. i rename the three columns.

edx.copy <-  edx %>%
            select(-c("genres","title","timestamp"))

names(edx.copy) <- c("user", "item", "rating")


valid.copy <-  validation %>%
  select(-c("genres","title","timestamp"))

names(valid.copy) <- c("user", "item", "rating")


#as matrix
edx.copy <- as.matrix(edx.copy)
valid.copy <- as.matrix(valid.copy)


#write edx.copy and valid.copy tables on disk 
write.table(edx.copy , file = "trainset.txt" , sep = " " , row.names = FALSE, col.names = FALSE)
write.table(valid.copy, file = "validset.txt" , sep = " " , row.names = FALSE, col.names = FALSE)


#data_file(): Specifies a data set from a file in the hard disk. 

set.seed(123) # This is a randomized algorithm
train_set <- data_file(system.file( "dat" ,"trainset.txt" , package = "recosystem"))
valid_set <- data_file(system.file( "dat" ,"validset.txt" , package = "recosystem"))


#Next step is to build Recommender object
r = Reco()


#Optimizing/tuning the recommender model
opts <- r$tune(train_set , opts = list(dim = c(1:20), lrate = c(0.05),
                                            nthread = 4 , costp_l1=0, 
                                            costq_l1 = 0,
                                            niter = 50, nfold = 10,
                                            verbose = FALSE))

#trains the recommender model
r$train(train_set, opts = c(opts$min , nthread = 4, niter = 100,verbose=FALSE))


#Making prediction on validation set and calculating RMSE:
pred_file = tempfile()

r$predict(valid_set, out_file(pred_file))

prediction output generated at C:\Users\LD493E~1.MON\AppData\Local\Temp\Rtmp2nldyL\file1850377a7d83

#Matrix Factorization : show first 10 predicted values
print(scan(pred_file, n = 10))

 4.63566 4.97483 4.98619 4.08538 4.71547 2.88960 4.14103 4.39614
 4.65707 3.52472

#valid_set
scores_real <- read.table("validset.txt", header = FALSE, sep = " ")$V3
scores_pred <- scan(pred_file)

# remove edx.copy and valid.copy objects
rm(edx.copy, valid.copy)


rmse_mf_opt <- RMSE(scores_real,scores_pred)

rmse_mf_opt

 0.7887485

load("EnvCapstone_MatrixFacto_trainRmse.RData")

iter.line <- 15
tr_rmse.line <- mat.facto_rmse$tr_rmse[which(mat.facto_rmse$iter==15)]


mat.facto_rmse %>% 
  ggplot(aes(x=iter, y = tr_rmse))+
  geom_point(size= 5 , shape = 19 ) + 
  geom_smooth(aes(x= iter, y = tr_rmse)) +
  geom_segment(x=0,xend=iter.line ,y=tr_rmse.line,yend=tr_rmse.line, color="red", lty=2)+
  geom_segment(x=iter.line, xend=iter.line, y=0, yend=tr_rmse.line, color="red", lty=2) +
  annotate(geom="label",x = iter.line,y = 0.8350,color=2,     
           label=paste("x=",round(iter.line,0),"\ny=",round(tr_rmse.line ,4)))+
  labs(title="RMSE for different number of latent factors" ,
       caption = "based on the output of r$train(train_set, opts = c(opts$min, nthread = 4, niter = 100), \n show just first 30 iterations)"
  ) +
  ylab("RMSE") +
  xlab("Latent factors")

We focused on optimizing the performance of (ii) using 10 fold cross validation ( nfolds = 10). This consisted to understand the behavior of the rmse value for the Matrix Factorization recommender engine when modifying values of number of latent factors (dim), gradient descend step rate (lrate), penalty parameters to avoid overfitting (cost) and number of threads for parallel computing (nthread). The figure above shows the number of latent factors vs. cross-validation RMSE . We noticed that the training model just needed 15 latent factors to reach a rmse value lower than 0.8. To sum up, we observed that even with these optimal criteria, the RMSE value for the Matrix factorization method is still below 0.8.(see value of rmse_mf_opt)

VI. Conclusion and suggestions

This MovieLens project has examined the potential best recommender system algorithm to predict movie ratings for the 10M version of the Movielens data. Using the provided training set (edx) and validation set, we successively trained different linear regression models, recommender engines and Ensemble methods. The model evaluation performance through the RMSE ( root mean squared error) showed that the Linear regression model with regularized effects on users and movies, and the Matrix Factorization with Stochastic gradient descent are the two appropriate recommender systems to predict ratings on the validation set. Using 10 fold cross-validation to increase the performance of the Matrix Factorization method , in the sense of understanding the behavior of the rmse value with changing optimal criteria( dim, nthread, lrate, lim), we winded up a rmse value always less than 0.8.
Future work includes further investigations on Ensemble methods in order to get lower RMSE values than the 0.9 order we got in Ensemble methods section. GBDT combine some advantages like including an ability to find non-linear transformations, ability to handle skewed variables without requiring transformations, computational robustness (e.g., highly collinear variables are not an issue) and high scalability. They also naturally lend themselves to parallelization. This has made them a good choice for several large scale practical problems such as ranking results of a search engine (Bellkor, 2009). Following this Bellkor winning solution, to get a rmse value less than 0.88 on Ensemble methods we should combine over 500 models adding not only the clustering effects of Number of movies each user rated, Number of users that rated each movie, but also hidden units of a restricted Boltzmann Machine.

References

Aggarwal, C. C. (2016). Recommender systems (pp. 1-28). Cham: Springer International Publishing.
Ricci, F., Rokach, L., & Shapira, B. (2015). Recommender systems: introduction and challenges. In Recommender systems handbook (pp. 1-34). Springer, Boston, MA.
Irizzary,R., 2018,Introduction to Data Science,github page,https://rafalab.github.io/dsbook/
Shani, G., & Gunawardana, A. (2011). Evaluating recommendation systems. In Recommender systems handbook (pp. 257-297). Springer, Boston, MA.
Saranya, K. G., Sadasivam, G. S., & Chandralekha, M. (2016). Performance comparison of different similarity measures for collaborative filtering technique. Indian J. Sci. Technol, 9(29), 1-8.
Koren, Y. (2009). The bellkor solution to the netflix grand prize. Netflix prize documentation, 81, 1-10.
Chin, W. S., Zhuang, Y., Juan, Y. C., & Lin, C. J. (2015). A fast parallel stochastic gradient method for matrix factorization in shared memory systems. ACM Transactions on Intelligent Systems and Technology (TIST), 6(1), 2.
Friedman, Jerome H. “Greedy Function Approximation: A Gradient Boosting Machine.” Annals of Statistics (2001): 1189-1232.
Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical learning (Vol. 1, No. 10). New York: Springer series in statistics.

Appendix

Some notes: Since the online book " Irizzary,R., 2018,Introduction to Data Science,github page,https://rafalab.github.io/dsbook/ " has undergone recent changes, you should consider in my reports : Dimension reduction paragraph(3.3) https://rafalab.github.io/dsbook/matrix-factorization.html moves to https://rafalab.github.io/dsbook/large-datasets.html#matrix-factorization
Evaluated Algorithms paragraph(4.1) https://rafalab.github.io/dsbook/recommendation-systems.html moves to https://rafalab.github.io/dsbook/large-datasets.html#recommendation-systems.

Thank you for pointing out any optics aiming to bring any corrections in both substance and form.

Louis Mono

Capstone1: MovieLens project

Louis Mono

March, 2019

Abstract

I. Introduction

II. Dataset and executive summary

1.Overview.

edx set

validation set

2.Data exploration

III. Data Preprocessing

1.Data transformation

2.Similarity Measures

3.Dimension reduction

4.Relevant Data

IV. Methods and Analysis

1.Evaluated Algorithms

2.Model performance evaluation

V. Results

1.Identifying the optimal model

1.2 Increasing model performance

VI. Conclusion and suggestions

References

Appendix