Recommender systems are one of the most successful and widespread application of machine learning technologies in business. There are a subsclass of information filtering system that seek to predict the “rating” or “preference” a user would give to an item. One of the most famous success story of the recommender system is the Netflix competition launched on october 2006. In 2009, at the end of the challenge , Netflix awarded a one million dollar prize to a developer team for an algorithm that increased the accuracy of the company’s recommendation engine by 10%. Many well-known recommendation algorithms, such as latent factor models, were popularized by the Netflix contest. The Netflix prize contest is become notable for its numerous contributions to the data science community. With the 10M version of movielens data , and following some ML techniques that went into the winning solution for the Netflix competition, our findings suggest that Linear regression model with regularized movie and user effects, and recommender engine using Matrix factorization with Stochastic Gradient descent lead us to the smallest Root Mean Squared Error(RMSE), 0.8648170 and 0.7826226 ,respectively.
According to Aggarwal(2016) , the recommendation problem may be formulated in various ways, among which the two main are as follows: The first approach , the “prediction version of problem” aims to predict the rating value for a user-item combination. It is also referred to as “matrix completion problem”. The second approach, the “ranking version of problem” seeks to recommend the top-k items for a particular user, or determine the top-k users to target for a particular item.
For the movielens project, we use the 10M version of the MovieLens dataset generated by the GroupLens research lab. We aim to create our own recommendation system using the “prediction version of problem”. We are going to train our algorithms using the inputs in edx set to predict movie ratings in the validation set. RMSE will be used to evaluate how close our predictions are to the true values in the validation set.This approach will be studied and the features processed and analyzed with different Machine Learning techniques, focusing on regression models, ensemble methods (gradient boosting, random forest ) and recommender engines ( slopeOne, matrix factorization with gradient descent) following the main steps of a data mining problem as described in Ricci et al (2015).
Loading library and data
library(tidyverse)
library(caret)
library(data.table)
library(kableExtra)
library(lubridate)
library(Matrix.utils)
library(DT)
library(wordcloud)
library(RColorBrewer)
library(ggthemes)
library(irlba)
library(recommenderlab)
library(SlopeOne)
library(recosystem)
library(h2o)
load("EnvCapstone1.RData") #load output of Capstone_Movielens_data.R
class(edx)
"data.frame"
glimpse(edx)
Observations: 9,000,055
Variables: 6
$ userId <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
$ movieId <dbl> 122, 185, 292, 316, 329, 355, 356, 362, 364, 370, 37...
$ rating <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5...
$ timestamp <int> 838985046, 838983525, 838983421, 838983392, 83898339...
$ title <chr> "Boomerang (1992)", "Net, The (1995)", "Outbreak (19...
$ genres <chr> "Comedy|Romance", "Action|Crime|Thriller", "Action|D...
class(validation)
"data.frame"
glimpse(validation)
Observations: 999,999
Variables: 6
$ userId <int> 1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5...
$ movieId <dbl> 231, 480, 586, 151, 858, 1544, 590, 4995, 34, 432, 4...
$ rating <dbl> 5.0, 5.0, 5.0, 3.0, 2.0, 3.0, 3.5, 4.5, 5.0, 3.0, 3....
$ timestamp <int> 838983392, 838983653, 838984068, 868246450, 86824564...
$ title <chr> "Dumb & Dumber (1994)", "Jurassic Park (1993)", "Hom...
$ genres <chr> "Comedy", "Action|Adventure|Sci-Fi|Thriller", "Child...
After generating codes provided in the project overview, we can see that the edX dataset is made of 6 features for a total of about 9,000,055 observations.The validation set which represents 10% of the 10M Movielens dataset contains the same features , but with a total of 999,999 occurences. we made sure that userId and movieId in edx set are also in validation set.
Each row represents a rating given by one user to one movie. The column “rating” is the outcome we want to predict, y. Taking into account both datasets, here are the features and their characteristics:
quantitative features
-userId : discrete, Unique ID for the user.
-movieId: discrete, Unique ID for the movie.
-timestamp : discrete , Date and time the rating was given.
qualitative features
-title: nominal , movie title (not unique)
-genres: nominal, genres associated with the movie.
outcome,y
-rating : continuous, a rating between 0 and 5 for the movie.
summary(edx)
userId movieId rating timestamp
Min. : 1 Min. : 1 Min. :0.500 Min. :7.897e+08
1st Qu.:18124 1st Qu.: 648 1st Qu.:3.000 1st Qu.:9.468e+08
Median :35738 Median : 1834 Median :4.000 Median :1.035e+09
Mean :35870 Mean : 4122 Mean :3.512 Mean :1.033e+09
3rd Qu.:53607 3rd Qu.: 3626 3rd Qu.:4.000 3rd Qu.:1.127e+09
Max. :71567 Max. :65133 Max. :5.000 Max. :1.231e+09
title genres
Length:9000055 Length:9000055
Class :character Class :character
Mode :character Mode :character
outcome, y : rating
#i create a dataframe "explore_ratings" which contains half star and whole star ratings from the edx set :
group <- ifelse((edx$rating == 1 |edx$rating == 2 | edx$rating == 3 |
edx$rating == 4 | edx$rating == 5) ,
"whole_star",
"half_star")
explore_ratings <- data.frame(edx$rating, group)
# histogram of ratings
ggplot(explore_ratings, aes(x= edx.rating, fill = group)) +
geom_histogram( binwidth = 0.2) +
scale_x_continuous(breaks=seq(0, 5, by= 0.5)) +
scale_fill_manual(values = c("half_star"="purple", "whole_star"="brown")) +
labs(x="rating", y="number of ratings", caption = "source data: edx set") +
ggtitle("histogram : number of ratings for each rating")
Exploring ratings of the edx set , we notice the following facts:
+ the average user's activity reveals that no user gives 0 as rating
+ the top 5 ratings from most to least are : 4, 3, 5, 3.5 and 2.
+ the histogram shows that the half star ratings are less common than whole star ratings.
qualitative features: genres, title
Now, we are going to explore the features “genres” and “title” of our edx set.
# i create a data frame top_genr which contains each single genre without their combinations ( ex : Action and not "Action|Crime|Thriller" , Comedy and not "Comedy|Romance") . In this way i can visualize which single genre has more ratings . I plot a word cloud.
top_genr <- edx %>% separate_rows(genres, sep = "\\|") %>%
group_by(genres) %>%
summarize(count = n()) %>%
arrange(desc(count))
datatable(top_genr, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T) ) %>%
formatRound('count',digits=0, interval = 3, mark = ",")
layout(matrix(c(1,2), nrow =2) , heights = c(1,4))
par(mar=rep(0,4))
plot.new()
text(x=0.5,y=0.5, "top Genres by number of ratings")
wordcloud(words=top_genr$genres,freq=top_genr$count,min.freq=50,
max.words = 20,random.order=FALSE,random.color=FALSE,
rot.per=0.35,colors = brewer.pal(8,"Dark2"),scale=c(5,.2),
family="plain",font=2,
main = "Top genres by number of ratings")
we notice that the “Drama” genre has the top number of movies ratings, followed by the “Comedy” and the “Action” genres. There’s a last category (where the genre is not listed) which contains only 7 movies ratings.
# the data frame top_title contains the top 20 movies which count the major number of ratings
top_title <- edx %>%
group_by(title) %>%
summarize(count=n()) %>%
top_n(20,count) %>%
arrange(desc(count))
# with the head function i output the top 5
kable(head(edx %>%
group_by(title,genres) %>%
summarize(count=n()) %>%
top_n(20,count) %>%
arrange(desc(count)) ,
5)) %>%
kable_styling(bootstrap_options = "bordered", full_width = F , position ="center") %>%
column_spec(1,bold = T ) %>%
column_spec(2,bold =T) %>%
column_spec(3,bold=T)
title | genres | count |
---|---|---|
Pulp Fiction (1994) | Comedy|Crime|Drama | 31362 |
Forrest Gump (1994) | Comedy|Drama|Romance|War | 31079 |
Silence of the Lambs, The (1991) | Crime|Horror|Thriller | 30382 |
Jurassic Park (1993) | Action|Adventure|Sci-Fi|Thriller | 29360 |
Shawshank Redemption, The (1994) | Drama | 28015 |
#bar chart of top_title
top_title %>%
ggplot(aes(x=reorder(title, count), y=count)) +
geom_bar(stat='identity', fill="blue") + coord_flip(y=c(0, 40000)) +
labs(x="", y="Number of ratings") +
geom_text(aes(label= count), hjust=-0.1, size=3) +
labs(title="Top 20 movies title based \n on number of ratings" , caption = "source data: edx set")
The inspection of the “title” attribute goes in accordance with the previous analysis. The movies which have the highest number of ratings are in the top genres categories : movies like Pulp fiction (1994), Forrest Gump(1994) or Jurrasic Park(1993) which are in the top 5 of movie’s ratings number , are part of the Drama, Comedy or Action genres.
# i take the original column "genre" from the edx set , whatever combination appears in this column .
# i compute the average and standard error for each "genre". i Plot these as error bar plots for genres with more than 100000 ratings.
edx %>% group_by(genres) %>%
summarize(n = n(), avg = mean(rating), se = sd(rating)/sqrt(n())) %>%
filter(n >= 100000) %>%
mutate(genres = reorder(genres, avg)) %>%
ggplot(aes(x = genres, y = avg, ymin = avg - 2*se, ymax = avg + 2*se)) +
geom_point() +
geom_errorbar() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "error bar plots by genres" , caption = "source data : edx set") +
theme(
panel.background = element_rect(fill = "lightblue",
colour = "lightblue",
size = 0.5, linetype = "solid"),
panel.grid.major = element_line(size = 0.5, linetype = 'solid',
colour = "white"),
panel.grid.minor = element_line(size = 0.25, linetype = 'solid',
colour = "white")
)
We observe that the generated plot shows strong evidence of a genre effect .
quantitative features: UserId, movieId, timestamp
edx %>%
summarize(n_users = n_distinct(userId),
n_movies = n_distinct(movieId))
n_users n_movies
1 69878 10677
Even if each row represents a rating given by one user to one movie, the number of uniques values for the userId is 69878 and for the movieId 10677 : Both usersId and movieId which are presented as integer should be presumably treat as factors for some analysis purposes. Also, this means that there are less movies provided for ratings than users that rated them . If we think in terms of a large matrix, with user on the rows and movies on the columns, a challenge we face is the sparsity of our matrix. This large matrix will contain many empty cells. More over, we face a curse of dimensionality problem .These issues should be treat in our further analysis.
# histogram of number of ratings by movieId
edx %>%
count(movieId) %>%
ggplot(aes(n)) +
geom_histogram( bins=30, color = "red") +
scale_x_log10() +
ggtitle("Movies") +
labs(subtitle ="number of ratings by movieId",
x="movieId" ,
y="number of ratings",
caption ="source data : edx set") +
theme(panel.border = element_rect(colour="black", fill=NA))
# histogram of number of ratings by userId
edx %>%
count(userId) %>%
ggplot(aes(n)) +
geom_histogram( bins=30, color = "gold") +
scale_x_log10() +
ggtitle("Users") +
labs(subtitle ="number of ratings by UserId",
x="userId" ,
y="number of ratings") +
theme(panel.border = element_rect(colour="black", fill=NA))
Visual exploration of the number of ratings by movieId on one hand and of the number of ratings by userId on the other hand shows the following relationships : some movies get rated more than others, and some users are more active than others at rating movies. This should presumably explain the presence of movies effects and users effects.
#we know that the edx set contains the timestamp variable which represents the time and data in which the rating was provided. The units are seconds since January 1, 1970. with the as_datetime function in the lubridate package , we can have each timestamp in the right format . I then use the point geom to create scatterplot of y = average ratings vs x = date , and smooth geom to aid the eyes in seeing patterns in the presence of overplotting.
edx %>%
mutate(date = round_date(as_datetime(timestamp), unit = "week")) %>%
group_by(date) %>%
summarize(rating = mean(rating)) %>%
ggplot(aes(date, rating)) +
geom_point() +
geom_smooth() +
ggtitle("Timestamp, time unit : week")+
labs(subtitle = "average ratings",
caption = "source data : edx set")
Analyzing the trend of the average ratings versus the date , we can notice that there is some evidence of a time effect in the plot, but there is not a strong effect of time.
Real-life data typically needs to be preprocessed (e.g. cleansed, filtered, transformed) in order to be used by the machine learning techniques in the analysis step.
In this section, we will focus mainly on data preprocessing techniques that are of particular importance when designing a Recommender system. These techniques include similarity measures (such as Euclidean distance, Cosine distance,etc) , sampling methods , and dimensionality reduction (such as PCA or SVD) . We will use them when necessary.
First avoid, we are going to create the rating matrix . In the Data exploration step, we already higlighted the sparsity problem when considering a large matrix with users on the rows and movies on the columns. Let’s build effectively that matrix.
Trying to build our matrix, we somehow encountered that with the huge amount of data we have, the dcast , acast functions of the reshape2 and data.table packages are very time consuming and don’t allocate vectors of size more than 2.8G. Then, we decided to go further with the Matrix packages : Matrix and Matrix.utils which contain the sparseMatrix function. The latter is less time consuming and deal more efficiently with the memory problem .
# As we said in the Data exploration step, usersId and movieId should be treat as factors for some analysis purposes. To perform this transformation we make a copy of edx set, since we want to keep unchanged our original training set.(edx)
edx.copy <- edx
edx.copy$userId <- as.factor(edx.copy$userId)
edx.copy$movieId <- as.factor(edx.copy$movieId)
# i used the SparseMatrix function . The output is a sparse matrix of class dgcMatrix.
# In order to use this function, i need to convert userId & movieId into numeric vectors.
edx.copy$userId <- as.numeric(edx.copy$userId)
edx.copy$movieId <- as.numeric(edx.copy$movieId)
sparse_ratings <- sparseMatrix(i = edx.copy$userId,
j = edx.copy$movieId ,
x = edx.copy$rating,
dims = c(length(unique(edx.copy$userId)),
length(unique(edx.copy$movieId))),
dimnames = list(paste("u", 1:length(unique(edx.copy$userId)), sep = ""),
paste("m", 1:length(unique(edx.copy$movieId)), sep = "")))
# i can remove the copy created
rm(edx.copy)
#give a look on the first 10 users
sparse_ratings[1:10,1:10]
10 x 10 sparse Matrix of class "dgCMatrix"
u1 . . . . . . . . . .
u2 . . . . . . . . . .
u3 . . . . . . . . . .
u4 . . . . . . . . . .
u5 1 . . . . . 3 . . .
u6 . . . . . . . . . .
u7 . . . . . . . . . .
u8 . 2.5 . . 3 4 . . . .
u9 . . . . . . . . . .
u10 . . . . . . 3 . . .
#Convert rating matrix into a recommenderlab sparse matrix
ratingMat <- new("realRatingMatrix", data = sparse_ratings)
ratingMat
69878 x 10677 rating matrix of class 'realRatingMatrix' with 9000055 ratings.
More often, data mining techniques that are used for the modelling of different recommender systems approaches( collaborative filtering, content based, hybrid methods) are highly dependent on defining an appropriate similarity or distance measure.
To measure the similarity between users or between items, it is possible to use the following:
+ Minkowski Distance
+ Mahalanobis distance
+ Pearson correlation
+ Cosine similarity
we will use the cosine similarity which is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. According to Ricci et al(2015), it is defined as follow :
\[\cos(\pmb x, \pmb y) = \frac {\pmb x \cdot \pmb y}{||\pmb x|| \cdot ||\pmb y||}\]
where \(\cdot\) indicates vector dot product and ||x|| is the norm of vector x .
The main advantages using this distance measure , as developed in Saranya et al (2016) are :
Because of our huge amount of data, we calculate our similarity on the first 50 users for the visualization.
#i calculate the user similarity using the cosine similarity
similarity_users <- similarity(ratingMat[1:50,],
method = "cosine",
which = "users")
image(as.matrix(similarity_users), main = "User similarity")
#Using the same approach, I compute similarity between movies.
similarity_movies <- similarity(ratingMat[,1:50],
method = "cosine",
which = "items")
image(as.matrix(similarity_movies), main = "Movies similarity")
In the given matrices, each row and each column corresponds to a user, and each cell corresponds to the similarity between two users . The more red the cell is, the more similar two users are. Note that the diagonal is red, since it’s comparing each user with itself.
When we observe the two similarity matrices, they leave out the following plausible analysis : since there are more similar ratings between certain users than others, and more similar ratings between certain movies than others, we can evidence the existence of a group of users pattern or a group of movies pattern.
The analysis of the previous similarity matrices leads to the thought that it can exists users with similar ratings patterns and movies with similar rating patterns. However Sparsity and the curse of dimensionality remain a recurent problem, and we have to deal with many NA too. Dimensionality reduction techniques such as “pca” and “svd” can help overcome these problem by transforming the original high-dimensional space into a lower-dimensionality.
To face the RAM memory problem, we are going to use the Irlba package, which it is a fast and memory-efficient way to compute a partial SVD. The augmented implicitly restarted Lanczos bidiagonalization algorithm (IRLBA) finds a few approximate largest (or, optionally, smallest) singular values and corresponding singular vectors of a sparse or dense matrix using a method of Baglama and Reichel
According to Irizarry,R 2018 Matrix factorization,github page,accessed 15 January 2019, https://rafalab.github.io/dsbook/matrix-factorization.html , the SVD tells us that we can decompose a \(N × P\) matrix \(Y\) with \(P < N\) as follow : \[Y = UDV^T\] , where
+ U : orthogonal matrix of dimensions N x m
+ D : diagonal matrix containing the singular values of the original matrix, m x m
+ V : orthogonal matrix of dimensions m x P
set.seed(1)
Y <- irlba(sparse_ratings,tol=1e-4,verbose=TRUE,nv = 100, maxit = 1000)
# plot singular values
plot(Y$d, pch=20, col = "blue", cex = 1.5, xlab='Singular Value', ylab='Magnitude',
main = "Singular Values for User-Movie Matrix")
# calculate sum of squares of all singular values
all_sing_sq <- sum(Y$d^2)
# variability described by first 6, 12, and 20 singular values
first_six <- sum(Y$d[1:6]^2)
print(first_six/all_sing_sq)
0.6187623
first_12 <- sum(Y$d[1:12]^2)
print(first_12/all_sing_sq)
0.7052297
first_20 <- sum(Y$d[1:20]^2)
print(first_20/all_sing_sq)
0.7646435
perc_vec <- NULL
for (i in 1:length(Y$d)) {
perc_vec[i] <- sum(Y$d[1:i]^2) / all_sing_sq
}
plot(perc_vec, pch=20, col = "blue", cex = 1.5, xlab='Singular Value', ylab='% of Sum of Squares of Singular Values', main = "Choosing k for Dimensionality Reduction")
lines(x = c(0,100), y = c(.90, .90))
First six singular values explain more than half of the variability of the imputed ratings matrix, with the first dozen explaining nearly 70% and the first twenty explaining more than 75%. However,the goal is to identify the first k singular values whose squares sum to at least 90% of the total of the sums of the squares of all of the singular values. A plot of a running sum of squares for the singular values shows that the 90% hurdle is achieved using somewhere between 50 and 60 singular values.
#To find the exact value of k, i calculate the length of the vector that remains from our running sum of squares after excluding any items within that vector that exceed 0.90.
k = length(perc_vec[perc_vec <= .90])
k
55
#I get the decomposition of Y ; matrices U, D, and V accordingly:
U_k <- Y$u[, 1:k]
dim(U_k)
69878 55
D_k <- Diagonal(x = Y$d[1:k])
dim(D_k)
55 55
V_k <- t(Y$v)[1:k, ]
dim(V_k)
55 10677
We notice that k=55 will retain 90% of variability . As we can see above, we now have a matrix \(D_k\) of size 55 x 55, a matrix \(U_k\) of size 69878 x 55, and a matrix \(V_k\) of size 55 x 10677. Therefore, the total number of numeric values required to house these component matrices is \((69878 * 55) + (55 * 55) + (55 * 10677) = 4,433,550\) . This represents an approximately 50.7% decrease in required storage relative to the original 9,000,055 entries.
Reducing the dimensionality, the RAM memory problem persists. Thats why we needed to go further with another reduction technique. We selected the relevant data using the whole rating matrix.
We know that some users saw more movies than the others. So, instead of displaying some random users and movies, we should select the most relevant users and movies. Thus we visualize only the users who have seen many movies and the movies that have been seen by many users.To identify and select the most relevant users and movies, we follow these steps:
#a.
min_n_movies <- quantile(rowCounts(ratingMat), 0.9)
print(min_n_movies)
90%
301
#b.
min_n_users <- quantile(colCounts(ratingMat), 0.9)
print(min_n_users)
90%
2150.2
#c.
ratings_movies <- ratingMat[rowCounts(ratingMat) > min_n_movies,
colCounts(ratingMat) > min_n_users]
ratings_movies
6978 x 1068 rating matrix of class 'realRatingMatrix' with 2313148 ratings.
we can notice that now, we have a rating matrix of 6978 distinct users (rows) x 1068 distinct movies(columns) , with 2,313,148 ratings .
The data preprocessing phase is usually not definitive because it requires a lot of attention and subsequent various explorations on the variables. It must be aimed at obtaining better predictive results and in this sense, the further phases of model evaluations can help us to understand which particular preprocessing approaches are actually indispensable or useful for a specific model purpose.
In this section, we are going to explain the methodology over differents Machine Learning algorithms we used , and present the metric for the model performance evaluation.
1.1
Regression Models
As in Irizarry,R 2018 Recommender systems,github page,accessed 5 January 2019, https://rafalab.github.io/dsbook/recommendation-systems.html, we followed the same approach to build our linear regression models as the simplest possible recommendation systems. We started from considering the same rating for all movies and users with all the differences explained by random variation \(Y_{u,i} = \mu + \varepsilon_{u,i}\) and thus, modelling successively the different effects.
movie effects : since we know that some movies are generally rated higher than others, we can augment our previous model by adding the term \(b_i\) to represent average ranking for movie \(i\): \[Y_{u,i} = \mu + b_i + \varepsilon_{u,i} (1)\] where :
° \(\mu\) the “true” rating for all movies
° \(b_i\) bs effects or bias , movie-specific effect.
° \(\varepsilon_{u,i}\) independent errors sampled from the same distribution centered at 0
movie + user effects : We also know that some users are more active than others at rating movies. This implies that an additional improvement to our model may be : \[Y_{u,i} = \mu + b_i + b_u + \varepsilon_{u,i} (2)\] where :
° \(\mu\) , \(b_i\) , \(\varepsilon_{u,i}\) are defined as in \((1)\)
° \(b_u\) user-specific effect
movie + user + time effects . As in data exploration we showed some evidence of time effect, if we define with \(d_{u,i}\) as the day for user’s \(u\) rating of movie \(i\) the new model is the following : \[Y_{u,i} = \mu + b_i + b_u + f(d_{u,i}) + \varepsilon_{u,i} (3)\] with \(f\) a smooth function of \(d_{u,i}\) . A further and detailed explanations of time effects are developped in Koren(2009). For convenience in our study , we just converted timestamp to a datetime object and rounded it to the weekly unit.
movie + user + genre effects : The same approach with the genre effect leads to the following model , \[Y_{u,i} = \mu + b_i + b_u + \sum{k=1}^K x_{u,i} \beta_k + \varepsilon_{u,i} (4)\] where \(g_{u,i}\) is defined as the genre for user’s \(u\) and \(x^k_{u,i} = 1\) if \(g_{u,i}\) is genre \(k\). We just present the model here, but we didn’t perform it in our analysis.
To fit these models and get least square estimates of \(\hat{b}_i\) , \(\hat{b}_u\) , \(\hat{b}_t\) , using the lm() function of the stats package will be very slow since there are more than 10,000 of \({b}_i\) effects and more that 60,000 \({b}_u\) effects. For the different models (1),(2),(3) we get estimates of least squares as follow :
° \(\hat{b}_i\) as the average of \(y_{u,i} - \hat{\mu}\) for each movie i and the predicted ratings are : \(\hat{y}_{u,i} = \hat{\mu} + \hat{b}_i\) \((1')\)
° \(\hat{b}_u\) as the average of \(y_{u,i} - \hat{\mu} - \hat{b}_i\) and the predicted ratings are : \(\hat{y}_{u,i} = \hat{\mu} + \hat{b}_i + \hat{b}_u\) \((2')\)
° \(\hat{b}_t\) as the average of \(y_{u,i} - \hat{\mu} - \hat{b}_i -\hat{b}_u\) and the predicted ratings are : \(\hat{y}_{u,i} = \hat{\mu} + \hat{b}_i + \hat{b}_u + \hat{b}_t\) \((3')\)
Regularization permits us to penalize large estimates that come from small sample sizes. It has commonalities with the Bayesian approach that shrunk predictions. The general idea is to add a penalty for large values of bi, bu to the sum of squares equation that we minimize. So having many large bi or bu makes it harder to minimize.
A more accurate estimation of bu and bi will treat them symmetrically, by solving the least squares problem \[\frac{1}{N} \sum_{u,i} \left(y_{i,u} - \mu - b_i - b_u \right)^2 + \lambda \left(\sum_{i} b_i^2 + \sum_{u} b_u^2\right) (5)\] where the first term , \(\frac{1}{N} \sum_{u,i} \left(y_{i,u} - \mu - b_i - b_u \right)^2\), strives to find bu’s and bi’s that fit the given ratings. The regularizing term, \(\lambda \left(\sum_{i} b_i^2 + \sum_{u} b_u^2\right)\), avoids overfitting by penalizing the magnitudes of the parameters. This least square problem can be solved fairly efficiently by the method of stochastic gradient descent that will be object in the matrix factorization recommender engine.
At this step, we used cross validation to pick the best \(\lambda\) (which corresponds to the minimum RMSE) and using calculus we can show that the values of \(b_i\) and \(b_u\) that minimize this equation are :
\[\hat{b}_i(\lambda) = \frac{1}{\lambda + n_i} \sum_{u=1}^{n_i} \left(Y_{u,i} - \hat{\mu}\right)\]
\[\hat{b}_u(\lambda) = \frac{1}{\lambda + n_i} \sum_{u=1}^{n_i} \left(Y_{u,i} - \hat{\mu} - \hat{b}_i \right)\]
where \(n_i\) is the number of ratings made for movie \(i\).
When \(n_i\) is very large, which will give us a stable estimate, then \(\lambda\) is effectively ignored since \(n_i+\lambda \approx n_i\) . However, when \(n_i\) is small, then the estimate \(\hat{b}_i(\lambda)\) is shrunken towards 0 .The larger \(\lambda\), the more we shrink.
1.2
Recommender engines
Slope One was introduced in a 2005 paper by Daniel Lemire and Anna Maclachlan , ‘Slope One Predictors for Online Rating-Based Collaborative Filtering’. This algorithm is one of the simplest way to perform collaborative filtering based on items’ similarity. This makes it very easy to implement and use, and accuracy of this algorithm equals to the accuracy of more complicated and resource-intensive algorithms. Slope One method operates with average differences of ratings between each item and makes predictions based on their weighted value. More details are given in the said paper.
To perform the SlopeOne method , we need to perform a specific pre-processing approach. We followed the different steps :
Matrix Factorization is a popular technique to solve recommender system problem. The main idea is to approximate the matrix \(R_{m\times n}\) by the product of two matrixes of lower dimension, \(P_{k\times m}\) and \(Q_{k\times n}\) such that : \[R\approx P'Q\] . Let \(p_u\) be the u-th column of \(P\), and \(q_v\) be the v-th column of \(Q\), then the rating given by user \(u\) on item \(v\) would be predicted as \(p'_u q_v\). A typical solution for \(P\) and \(Q\) is given by the following regularization problem as defined in Chin et al(2015) : \[\min_{P,Q} \sum_{(u,v)\in R} \left[f(p_u,q_v;r_{u,v})+\mu_P||p_u||_1+\mu_Q||q_v||_1+\frac{\lambda_P}{2} ||p_u||_2^2+\frac{\lambda_Q}{2} ||q_v||_2^2\right]\] where \((u,v)\) are locations of observed entries in \(R\), \(r_{u,v}\) is the observed rating, \(f\) is the loss function, and \(\mu_P\),\(\mu_Q\),\(\lambda_P\),\(\lambda_Q\) are penalty parameters to avoid overfitting.
Matrix P represents latent factors of users. So, each k-elements column of matrix P represents each user. Each k-elements column of matrix Q represents each item . So, to find rating for item i by user u we simply need to compute two vectors: P[,u]’ x Q[,i]. Further descriptions of this technique and the recosystem package are available here
To perform our recommender system using parallel Matrix factorization with stochastic gradient descent, we followed the different steps:
As in the SlopeOne method, we created an identical copy of edx and validation set (edx.copy and valid.copy) , selecting only userId, movieId and rating columns. With the recosystem package, the data file for training set needs to be arranged in sparse matrix triplet form, i.e., each line in the file contains three numbers “user_index”, “item_index”, “rating”.
No RAM problem : Unlike most other R packages for statistical modeling that store the whole dataset and model object in memory, recosystem can significantly reduce memory use, for instance the constructed model that contains information for prediction can be stored in the hard disk, and output result can also be directly written into a file rather than be kept in memory. Thats why we simply use our whole edx.copy as train set (9,000,055 occurences) and valid.copy as validation set (999,999 occurences).
Finally, we create a model object by calling Reco() , call the tune() method to select best tuning parameters along a set of candidate values , train the model by calling the train() method and use the predict() method to compute predicted values.
In our study, before to perform SlopeOne and Matrix factorization, we also presented recommender algorithms of the recommenderlab package( UBCF, IBCF , popular). However, we will not develop them here since they were not relevant for the evaluation through the root mean squared error(RMSE) . Short description of these methods are available here or in this kernel.
1.3
Ensemble Methods
Gradient Boosting Machine (for Regression and Classification) is a forward learning ensemble method. The guiding heuristic is that good predictive results can be obtained through increasingly refined approximations. H2O’s GBM sequentially builds regression trees on all the features of the dataset in a fully distributed way - each tree is built in parallel.
H2O’s Gradient Boosting Algorithms follow the algorithm specified by Hastie et al (2001):
Initialize \(f_{k0} = 0, k=1,2,…,K\)
For \(m=1\) to \(M\):
1.set \(p_{k}(x)=\frac{e^{f_{k}(x)}}{\sum_{l=1}^{K}e^{f_{l}(x)}},k=1,2,…,K\)
2.For \(k=1\) to \(K\):
a. compute \(r_{ikm}=y_{ik}-p_{k}(x_{i}),i=1,2,…,N\)
b. Fit a regression tree to the targets \(r_{ikm},i=1,2,…,N\), giving terminal regions \(R_{jim},j=1,2,…,J_{m}\)
c. compute \(\gamma_{jkm}=\frac{K-1}{K} \frac{\sum_{x_{i} \in R_{jkm}}(r_{ikm})}{\sum_{x_{i} \in R_{jkm}}|r_{ikm}|(1-|r_{ikm})},j=1,2,…,J_m\)
d. Update \(f_{km}(x)=f_{k,m-1}(x)+\sum_{j=1}^{J_m}\gamma_{jkm} I(x\in R_{jkm})\)
Output \(\hat{f_{k}}(x)=f_{kM}(x),k=1,2,…,K\)
The gbm h2o function has many parameters. For example, nfolds, ntrees, max_depth or learn rate which control cross-validation , max number of trees , maximum tree depth and learning ratecan help to overcome overfitting. Instead, GBM’s parallel performance is strongly determined by the max_depth, nbins, nbins_cats parameters. In general, for datasets over 10GB, it makes sense to use 2 to 4 nodes; for datasets over 100GB, it makes sense to use over 10 nodes, and so on. Since GBDTs have a built-in ability to apply different methods to different slices of the data, we added in some predictors that help the trees make useful clusterings: not only factors vectors of users and movies, but also number of movies each user rated (n.movies_byUser) and number of users that rated each movie (n.users_bymovie).
After creating a copy of edx set with all features(same for the validation set), with the mutate function we added the n_m and n_u attributes, converted usersId and movieId into factors, split the data intro train and test , perform the gbm algorithm with some tunes on parameters, and finally make prediction on the validation set.
The next steps consisted to build :
For more details on Machine learning h2o algorithms (Gradient boosting, random forest,etc) and the h2o package, follow h2oalgorithms and h2o package, respectively.
To assess our model performance , we seek to evaluate how close our predictions are to the true rating values in the validation set. For this, we take into account the Root Mean Square Error (RMSE).
To construct the RMSE, you first need to determine the residuals. Residuals are the difference between the actual values and the predicted values. I denoted them by \(\hat{y}_{u,i} -y_{u,i}\), where \(y_{u,i}\) is the observed value for the ith observation and \(\hat{y}_{u,i}\) is the predicted value.
They can be positive or negative as the predicted value under or over estimates the actual value. Squaring the residuals, averaging the squares, and taking the square root gives us the RMSE. We then use the RMSE as a measure of the spread of the y values about the predicted y value.
\[\mbox{RMSE} = \sqrt{\frac{1}{N} \sum_{u,i}^{} \left( \hat{y}_{u,i} - y_{u,i} \right)^2 }\]
#i define the RMSE function as:
RMSE <- function(true_ratings, predicted_ratings){
sqrt(mean((true_ratings - predicted_ratings)^2))
}
1.1
Regression Models
#a.movie effect
# i calculate the average of all ratings of the edx set
mu <- mean(edx$rating)
# i calculate b_i on the training set
movie_avgs <- edx %>%
group_by(movieId) %>%
summarize(b_i = mean(rating - mu))
# predicted ratings
predicted_ratings_bi <- mu + validation %>%
left_join(movie_avgs, by='movieId') %>%
.$b_i
#b.movie + user effect
#i calculate b_u using the training set
user_avgs <- edx %>%
left_join(movie_avgs, by='movieId') %>%
group_by(userId) %>%
summarize(b_u = mean(rating - mu - b_i))
#predicted ratings
predicted_ratings_bu <- validation %>%
left_join(movie_avgs, by='movieId') %>%
left_join(user_avgs, by='userId') %>%
mutate(pred = mu + b_i + b_u) %>%
.$pred
#c.movie + user + time effect
#i create a copy of validation set , valid, and create the date feature which is the timestamp converted to a datetime object and rounded by week.
valid <- validation
valid <- valid %>%
mutate(date = round_date(as_datetime(timestamp), unit = "week"))
# i calculate time effects ( b_t) using the training set
temp_avgs <- edx %>%
left_join(movie_avgs, by='movieId') %>%
left_join(user_avgs, by='userId') %>%
mutate(date = round_date(as_datetime(timestamp), unit = "week")) %>%
group_by(date) %>%
summarize(b_t = mean(rating - mu - b_i - b_u))
# predicted ratings
predicted_ratings_bt <- valid %>%
left_join(movie_avgs, by='movieId') %>%
left_join(user_avgs, by='userId') %>%
left_join(temp_avgs, by='date') %>%
mutate(pred = mu + b_i + b_u + b_t) %>%
.$pred
#d. i calculate the RMSE for movies, users and time effects
rmse_model1 <- RMSE(validation$rating,predicted_ratings_bi)
rmse_model1
0.9439087
rmse_model2 <- RMSE(validation$rating,predicted_ratings_bu)
rmse_model2
0.8653488
rmse_model3 <- RMSE(valid$rating,predicted_ratings_bt)
rmse_model3
0.8652511
we can notice that with the movie and user effects combined, our RMSE decreased by almost 10% with respect to the only movie effect. The improvement when we add the time effect is not significant,( about a decrease of 0.011%). We are going to perform regularization as explained in the Data analysis and methods section using only the movie and user effects .
#before to proceed with regularization, i just remove the object copy of validation, "valid"
rm(valid)
#e. regularization
# remembering (5), $\lambda$ is a tuning parameter. We can use cross-validation to choose it
lambdas <- seq(0, 10, 0.25)
rmses <- sapply(lambdas, function(l){
mu_reg <- mean(edx$rating)
b_i_reg <- edx %>%
group_by(movieId) %>%
summarize(b_i_reg = sum(rating - mu_reg)/(n()+l))
b_u_reg <- edx %>%
left_join(b_i_reg, by="movieId") %>%
group_by(userId) %>%
summarize(b_u_reg = sum(rating - b_i_reg - mu_reg)/(n()+l))
predicted_ratings_b_i_u <-
validation %>%
left_join(b_i_reg, by = "movieId") %>%
left_join(b_u_reg, by = "userId") %>%
mutate(pred = mu_reg + b_i_reg + b_u_reg) %>%
.$pred
return(RMSE(validation$rating,predicted_ratings_b_i_u))
})
qplot(lambdas, rmses)
#For the full model, the optimal λ is:
lambda <- lambdas[which.min(rmses)]
lambda
5.25
rmse_model4 <- min(rmses)
rmse_model4
0.864817
#summarize all the rmse on validation set for Linear regression models
rmse_results <- data.frame(methods=c("movie effect","movie + user effects","movie + user + time effects", "Regularized Movie + User Effect Model"),rmse = c(rmse_model1, rmse_model2,rmse_model3, rmse_model4))
kable(rmse_results) %>%
kable_styling(bootstrap_options = "striped" , full_width = F , position = "center") %>%
kable_styling(bootstrap_options = "bordered", full_width = F , position ="center") %>%
column_spec(1,bold = T ) %>%
column_spec(2,bold =T ,color = "white" , background ="#D7261E")
methods | rmse |
---|---|
movie effect | 0.9439087 |
movie + user effects | 0.8653488 |
movie + user + time effects | 0.8652511 |
Regularized Movie + User Effect Model | 0.8648170 |
we observe that regularization gets down the RMSE’s value to 0.8648170.
1.2
recommender engines
# a. POPULAR , UBCF and IBCF algorithms of the recommenderlab package
model_pop <- Recommender(ratings_movies, method = "POPULAR",
param=list(normalize = "center"))
#prediction example on the first 10 users
pred_pop <- predict(model_pop, ratings_movies[1:10], type="ratings")
as(pred_pop, "matrix")[,1:10]
m1 m2 m3 m5 m6 m7 m9
u8 3.855564 NA 2.923282 NA NA 3.092309 2.574400
u17 NA NA 2.969625 NA 3.802000 NA 2.620742
u28 3.271469 NA NA NA 3.171562 2.508213 NA
u30 NA NA NA 2.792031 NA 3.109045 2.591136
u43 4.664153 3.756804 3.731871 3.583884 4.564246 NA 3.382989
u48 NA NA NA 3.448791 4.429153 3.765804 3.247895
u57 NA NA 2.652900 2.504913 3.485275 2.821926 2.304018
u70 4.421321 3.513973 3.489039 3.341052 4.321415 3.658066 3.140157
u88 NA 3.038295 3.013362 2.865375 3.845737 3.182388 2.664480
u103 NA 2.777942 2.753008 2.605021 NA 2.922034 2.404126
m10 m11 m14
u8 3.314650 3.459774 3.416521
u17 NA 3.506116 3.462863
u28 2.730555 2.875678 2.832426
u30 NA NA 3.433257
u43 4.123238 NA 4.225109
u48 NA 4.133269 4.090016
u57 3.044268 NA 3.146139
u70 3.880407 4.025531 3.982278
u88 3.404729 3.549853 3.506600
u103 3.144375 3.289499 3.246247
#Calculation of rmse for popular method
set.seed(1)
e <- evaluationScheme(ratings_movies, method="split", train=0.7, given=-5)
#5 ratings of 30% of users are excluded for testing
model_pop <- Recommender(getData(e, "train"), "POPULAR")
prediction_pop <- predict(model_pop, getData(e, "known"), type="ratings")
rmse_popular <- calcPredictionAccuracy(prediction_pop, getData(e, "unknown"))[1]
rmse_popular
RMSE
0.8482917
#Estimating rmse for UBCF using Cosine similarity and selected n as 50 based on cross-validation
set.seed(1)
model <- Recommender(getData(e, "train"), method = "UBCF",
param=list(normalize = "center", method="Cosine", nn=50))
prediction <- predict(model, getData(e, "known"), type="ratings")
rmse_ubcf <- calcPredictionAccuracy(prediction, getData(e, "unknown"))[1]
rmse_ubcf
RMSE
0.8589153
#Estimating rmse for IBCF using Cosine similarity and selected n as 350 based on cross-validation
set.seed(1)
model <- Recommender(getData(e, "train"), method = "IBCF",
param=list(normalize = "center", method="Cosine", k=350))
prediction <- predict(model, getData(e, "known"), type="ratings")
rmse_ibcf <- calcPredictionAccuracy(prediction, getData(e, "unknown"))[1]
rmse_ibcf
RMSE
0.963769
To see how to make the choice of k in ubcf and ibcf methods based on cross-validation, read this kernel. We observe that user-based CF gives an ability to achieve lower RMSE on test set than Item-based CF and Popular methods.However, these methods don’t fill with the scope of our study since the rmse evaluation is made on a test set after partitioning. We want to predict ratings for the 999999 rows and then evaluate with RMSE how close these predictions are with respect to the true ratings values in validation set. Moreover, the new data in the predict function should be of a class “ratingMatrix”. Validation set doesn’t have to be modified.
We then moved to the SlopeOne and Matrix factorization methods.
#b. SlopeOne
#i create copy of training(edx) and validation sets where i retain only userId, movieId and rating
edx.copy <- edx %>%
select(-c("genres","title","timestamp"))
valid.copy <- validation %>%
select(-c("genres","title","timestamp"))
#i rename columns and convert them to characters for edx.copy and valid.copy sets : item_id is #seen as movie_id
names(edx.copy) <- c("user_id", "item_id", "rating")
edx.copy <- data.table(edx.copy)
edx.copy[, user_id := as.character(user_id)]
edx.copy[, item_id := as.character(item_id)]
names(valid.copy) <- c("user_id", "item_id", "rating")
valid.copy <- data.table(valid.copy)
valid.copy[, user_id := as.character(user_id)]
valid.copy[, item_id := as.character(item_id)]
#setkey() sorts a data.table and marks it as sorted (with an attribute sorted). The sorted columns are the key. The key can be any columns in any order. The columns are sorted in ascending order always. The table is changed by reference and is therefore very memory efficient.
setkey(edx.copy, user_id, item_id)
setkey(valid.copy, user_id, item_id)
#split data to create a small training sample ( to face the RAM memory issue)
set.seed(1)
idx <- createDataPartition(y = edx.copy$rating, times = 1, p = 0.1, list = FALSE)
edx.copy_train <- edx.copy[idx,]
#normalization
ratings_train_norm <- normalize_ratings(edx.copy_train)
#clear unusued memory and increase memory limit
invisible(gc())
invisible(memory.limit(size = 56000))
#Building Slope One model:
#knitr::knit_meta(class=NULL,clean=TRUE)
model <- build_slopeone(ratings_train_norm$ratings)
invisible(gc())
#Making predictions for validation set:
predictions <- predict_slopeone(model,
valid.copy[ , c(1, 2), with = FALSE],
ratings_train_norm$ratings)
unnormalized_predictions <- unnormalize_ratings(normalized = ratings_train_norm,
ratings = predictions)
rmse_slopeone <- RMSE( valid.copy$rating,unnormalized_predictions$predicted_rating)
# i remove the created copies of sets
rm(edx.copy,valid_copy,edx.copy_train)
rmse_slopeone
1.029637
#Before to perform MF method, i clear unusued memory
invisible(gc())
#c. Matrix Factorization with parallel stochastic gradient descent
#i create a copy of training(edx) and validation sets where i retain only userId, movieId and rating features. i rename the three columns.
edx.copy <- edx %>%
select(-c("genres","title","timestamp"))
names(edx.copy) <- c("user", "item", "rating")
valid.copy <- validation %>%
select(-c("genres","title","timestamp"))
names(valid.copy) <- c("user", "item", "rating")
#as matrix
edx.copy <- as.matrix(edx.copy)
valid.copy <- as.matrix(valid.copy)
#write edx.copy and valid.copy tables on disk
write.table(edx.copy , file = "trainset.txt" , sep = " " , row.names = FALSE, col.names = FALSE)
write.table(valid.copy, file = "validset.txt" , sep = " " , row.names = FALSE, col.names = FALSE)
# data_file(): Specifies a data set from a file in the hard disk.
set.seed(123) # This is a randomized algorithm
train_set <- data_file(system.file( "dat" ,"trainset.txt" , package = "recosystem"))
valid_set <- data_file(system.file( "dat" ,"validset.txt" , package = "recosystem"))
#Next step is to build Recommender object
r = Reco()
# Matrix Factorization : tuning training set
opts = r$tune(train_set, opts = list(dim = c(10, 20, 30), lrate = c(0.1, 0.2),
costp_l1 = 0, costq_l1 = 0,
nthread = 1, niter = 10))
opts
## $min
## $min$dim
## [1] 30
##
## $min$costp_l1
## [1] 0
##
## $min$costp_l2
## [1] 0.01
##
## $min$costq_l1
## [1] 0
##
## $min$costq_l2
## [1] 0.1
##
## $min$lrate
## [1] 0.1
##
## $min$loss_fun
## [1] 0.797987
##
##
## $res
## dim costp_l1 costp_l2 costq_l1 costq_l2 lrate loss_fun
## 1 10 0 0.01 0 0.01 0.1 0.8255899
## 2 20 0 0.01 0 0.01 0.1 0.8070605
## 3 30 0 0.01 0 0.01 0.1 0.8157530
## 4 10 0 0.10 0 0.01 0.1 0.8279932
## 5 20 0 0.10 0 0.01 0.1 0.8019461
## 6 30 0 0.10 0 0.01 0.1 0.8016862
## 7 10 0 0.01 0 0.10 0.1 0.8284874
## 8 20 0 0.01 0 0.10 0.1 0.8018076
## 9 30 0 0.01 0 0.10 0.1 0.7979870
## 10 10 0 0.10 0 0.10 0.1 0.8372085
## 11 20 0 0.10 0 0.10 0.1 0.8315828
## 12 30 0 0.10 0 0.10 0.1 0.8274714
## 13 10 0 0.01 0 0.01 0.2 0.8162612
## 14 20 0 0.01 0 0.01 0.2 0.9162043
## 15 30 0 0.01 0 0.01 0.2 0.8809001
## 16 10 0 0.10 0 0.01 0.2 0.8173864
## 17 20 0 0.10 0 0.01 0.2 0.8599332
## 18 30 0 0.10 0 0.01 0.2 0.9107871
## 19 10 0 0.01 0 0.10 0.2 0.8222024
## 20 20 0 0.01 0 0.10 0.2 0.8004464
## 21 30 0 0.01 0 0.10 0.2 0.7995085
## 22 10 0 0.10 0 0.10 0.2 0.8410999
## 23 20 0 0.10 0 0.10 0.2 0.8264263
## 24 30 0 0.10 0 0.10 0.2 0.8259178
# Matrix Factorization : trains the recommender model
r$train(train_set, opts = c(opts$min, nthread = 1, niter = 20))
## iter tr_rmse obj
## 0 0.9737 1.1979e+007
## 1 0.8708 9.7956e+006
## 2 0.8382 9.0926e+006
## 3 0.8175 8.6875e+006
## 4 0.8021 8.4172e+006
## 5 0.7900 8.2171e+006
## 6 0.7801 8.0667e+006
## 7 0.7719 7.9424e+006
## 8 0.7650 7.8477e+006
## 9 0.7590 7.7642e+006
## 10 0.7539 7.7000e+006
## 11 0.7493 7.6431e+006
## 12 0.7451 7.5922e+006
## 13 0.7413 7.5483e+006
## 14 0.7379 7.5110e+006
## 15 0.7347 7.4773e+006
## 16 0.7318 7.4464e+006
## 17 0.7290 7.4151e+006
## 18 0.7266 7.3907e+006
## 19 0.7243 7.3692e+006
#Making prediction on validation set and calculating RMSE:
pred_file = tempfile()
r$predict(valid_set, out_file(pred_file))
## prediction output generated at C:\Users\LD493E~1.MON\AppData\Local\Temp\Rtmp2nldyL\file18505d985386
#Matrix Factorization : show first 10 predicted values
print(scan(pred_file, n = 10))
## [1] 4.26273 5.17404 4.73851 3.44047 4.45806 2.81567 4.10844 4.34959
## [9] 4.25262 3.45201
#valid_set
scores_real <- read.table("validset.txt", header = FALSE, sep = " ")$V3
scores_pred <- scan(pred_file)
#remove copies of training and validation sets
rm(edx.copy, valid.copy)
rmse_mf <- RMSE(scores_real,scores_pred)
rmse_mf
## [1] 0.7826226
#summarize all the rmse for recommender algorithms
rmse_results <- data.frame(methods=c("SlopeOne","Matrix factorization with GD"),rmse = c(rmse_slopeone, rmse_mf))
kable(rmse_results) %>%
kable_styling(bootstrap_options = "striped" , full_width = F , position = "center") %>%
kable_styling(bootstrap_options = "bordered", full_width = F , position ="center") %>%
column_spec(1,bold = T ) %>%
column_spec(2,bold = T ,color = "white" , background ="#D7261E")
methods | rmse |
---|---|
SlopeOne | 1.0296368 |
Matrix factorization with GD | 0.7826226 |
Performing the Matrix factorization with GD recommender method, we achieved a RMSE definitely lower than 0.8. If we calculate the percentage decrease of rmse values between the two methods, we could see that the MF method shows a decrease of more than 23% with respect to the slopeOne method. Note that the high rmse value for the SlopeOne method is also due to the small training set we used to face the RAM problem(see Methods and Analysis section - SlopeOne method). Instead, by having a larger training set, we give our algorithm a better chance of understanding the patterns in the set, rather than just learning to identify specific examples from the training set. This could save our time with validation in the long run, since we won’t be dealing with quite so much overfitting.
1.3
Ensemble Methods
#knitr::knit_meta(class=NULL,clean=TRUE)
#clear unusued memory
invisible(gc())
# i create a copy of edx set where i retain all the features
edx.copy <- edx
# i create new columns n_m (number of movies each user rated) and n_u(Number of users that rated each movie) as defined in the Data analysis and method section
edx.copy <- edx.copy %>%
group_by(userId) %>%
mutate(n.movies_byUser = n())
edx.copy <- edx.copy %>%
group_by(movieId) %>%
mutate(n.users_bymovie = n())
# factor vectors of users and movies
edx.copy$userId <- as.factor(edx.copy$userId)
edx.copy$movieId <- as.factor(edx.copy$movieId)
#i do the same for the validation set
valida.copy <- validation
valida.copy <- valida.copy %>%
group_by(userId) %>%
mutate(n.movies_byUser = n())
valida.copy <- valida.copy %>%
group_by(movieId) %>%
mutate(n.users_bymovie = n())
valida.copy$userId <- as.factor(valida.copy$userId)
valida.copy$movieId <- as.factor(valida.copy$movieId)
#Attempts to start and/or connect to and H2O instance
h2o.init(
nthreads=-1, ## -1: use all available threads
max_mem_size = "10G") ## specify the memory size for the H2O cloud
## Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 2 days 19 hours
## H2O cluster timezone: Europe/Berlin
## H2O data parsing timezone: UTC
## H2O cluster version: 3.22.1.1
## H2O cluster version age: 3 months and 16 days !!!
## H2O cluster name: H2O_started_from_R_LD.MonoTsango_huk890
## H2O cluster total nodes: 1
## H2O cluster total memory: 4.60 GB
## H2O cluster total cores: 4
## H2O cluster allowed cores: 4
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## H2O API Extensions: Algos, AutoML, Core V3, Core V4
## R Version: R version 3.4.2 (2017-09-28)
#Remove all prior Clusters - remove in-memory objects to free up space
h2o.removeAll()
## [1] 0
#partitioning
splits <- h2o.splitFrame(as.h2o(edx.copy),
ratios = 0.7,
seed = 1)
##
|
| | 0%
|
|=================================================================| 100%
train <- splits[[1]]
test <- splits[[2]]
#clear unusued memory
invisible(gc())
#remove progress bar
h2o.no_progress()
#first gbm model : ntrees = 50, max depth = 5, learn rate = 0.1 , nfolds = 3
gbdt_1 <- h2o.gbm( x = c("movieId","userId","n.movies_byUser","n.users_bymovie") ,
y = "rating" ,
training_frame = train ,
nfolds = 3)
summary(gbdt_1)
Model Details:
==============
H2ORegressionModel: gbm
Model Key: GBM_model_R_1554985446584_5
Model Summary:
number_of_trees number_of_internal_trees model_size_in_bytes min_depth
1 50 50 40499 5
max_depth mean_depth min_leaves max_leaves mean_leaves
1 5 5.00000 31 32 31.96000
H2ORegressionMetrics: gbm
** Reported on training data. **
MSE: 0.993987
RMSE: 0.9969889
MAE: 0.7947012
RMSLE: 0.2649048
Mean Residual Deviance : 0.993987
H2ORegressionMetrics: gbm
** Reported on cross-validation data. **
** 3-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
MSE: 0.9881614
RMSE: 0.9940631
MAE: 0.792184
RMSLE: 0.2641577
Mean Residual Deviance : 0.9881614
Cross-Validation Metrics Summary:
mean sd cv_1_valid cv_2_valid
mae 0.79218405 2.8220305E-4 0.79227304 0.79262227
mean_residual_deviance 0.9881618 8.733483E-4 0.9878371 0.98981047
mse 0.9881618 8.733483E-4 0.9878371 0.98981047
r2 0.12153444 4.4647796E-4 0.12158577 0.120736726
residual_deviance 0.9881618 8.733483E-4 0.9878371 0.98981047
rmse 0.9940631 4.3923056E-4 0.99389994 0.9948922
rmsle 0.2641576 1.9758147E-4 0.26409227 0.26452777
cv_3_valid
mae 0.7916569
mean_residual_deviance 0.9868378
mse 0.9868378
r2 0.122280814
residual_deviance 0.9868378
rmse 0.9933971
rmsle 0.26385275
Scoring History:
timestamp duration number_of_trees training_rmse
1 2019-04-14 09:54:37 6 min 34.245 sec 0 1.06060
2 2019-04-14 09:54:39 6 min 36.048 sec 1 1.05347
3 2019-04-14 09:54:40 6 min 37.800 sec 2 1.04763
4 2019-04-14 09:54:45 6 min 41.968 sec 5 1.03557
5 2019-04-14 09:54:49 6 min 45.983 sec 8 1.02759
6 2019-04-14 09:54:53 6 min 50.112 sec 11 1.02167
7 2019-04-14 09:54:57 6 min 54.125 sec 14 1.01835
8 2019-04-14 09:55:02 6 min 59.305 sec 18 1.01422
9 2019-04-14 09:55:06 7 min 3.396 sec 21 1.01137
10 2019-04-14 09:55:10 7 min 7.479 sec 24 1.00976
11 2019-04-14 09:55:14 7 min 11.499 sec 27 1.00751
12 2019-04-14 09:55:18 7 min 15.503 sec 30 1.00611
13 2019-04-14 09:55:22 7 min 19.577 sec 33 1.00428
14 2019-04-14 09:55:27 7 min 24.751 sec 37 1.00208
15 2019-04-14 09:55:33 7 min 29.953 sec 41 1.00027
16 2019-04-14 09:55:38 7 min 35.106 sec 45 0.99838
17 2019-04-14 09:55:43 7 min 40.216 sec 49 0.99729
18 2019-04-14 09:55:44 7 min 41.798 sec 50 0.99699
training_mae training_deviance
1 0.85569 1.12487
2 0.85010 1.10979
3 0.84584 1.09752
4 0.83570 1.07241
5 0.82751 1.05595
6 0.82074 1.04380
7 0.81653 1.03703
8 0.81179 1.02865
9 0.80872 1.02287
10 0.80711 1.01962
11 0.80496 1.01509
12 0.80364 1.01225
13 0.80189 1.00858
14 0.79983 1.00416
15 0.79803 1.00054
16 0.79612 0.99675
17 0.79498 0.99458
18 0.79470 0.99399
Variable Importances: (Extract with `h2o.varimp`)
=================================================
Variable Importances:
variable relative_importance scaled_importance percentage
1 movieId 2432895.250000 1.000000 0.560556
2 n.users_bymovie 1524796.250000 0.626741 0.351323
3 n.movies_byUser 342112.750000 0.140620 0.078825
4 userId 40345.531250 0.016583 0.009296
#second gbm model : ntrees = 100, max depth = 5, learn rate = 0.1 , nfolds = 5
# gbdt_2 <- h2o.gbm( x = c("movieId","userId") ,
# y = "rating" ,
# training_frame = train ,
# ntrees = 100,
# nfolds = 5)
#
# summary(gbdt_2)
#third gbm model : ntrees = 50, max depth = 5, learn rate = 0.1 , nfolds = 3
gbdt_3 <- h2o.gbm( x = c("movieId","userId") ,
y = "rating" ,
training_frame = train ,
nfolds = 3,
seed=1,
keep_cross_validation_predictions = TRUE,
fold_assignment = "Random")
summary(gbdt_3)
Model Details:
==============
H2ORegressionModel: gbm
Model Key: GBM_model_R_1554985446584_6
Model Summary:
number_of_trees number_of_internal_trees model_size_in_bytes min_depth
1 50 50 75964 5
max_depth mean_depth min_leaves max_leaves mean_leaves
1 5 5.00000 32 32 32.00000
H2ORegressionMetrics: gbm
** Reported on training data. **
MSE: 0.9663329
RMSE: 0.9830223
MAE: 0.7816442
RMSLE: 0.2616349
Mean Residual Deviance : 0.9663329
H2ORegressionMetrics: gbm
** Reported on cross-validation data. **
** 3-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
MSE: 0.9742736
RMSE: 0.987053
MAE: 0.7853317
RMSLE: 0.2625142
Mean Residual Deviance : 0.9742736
Cross-Validation Metrics Summary:
mean sd cv_1_valid cv_2_valid
mae 0.7853309 9.0927485E-4 0.78488976 0.78707933
mean_residual_deviance 0.9742714 0.0023254724 0.9733581 0.9786775
mse 0.9742714 0.0023254724 0.9733581 0.9786775
r2 0.13388452 0.001791772 0.13440077 0.13055533
residual_deviance 0.9742714 0.0023254724 0.9733581 0.9786775
rmse 0.9870505 0.0011775987 0.98658913 0.9892813
rmsle 0.26251355 3.044451E-4 0.2623934 0.26309055
cv_3_valid
mae 0.7840236
mean_residual_deviance 0.9707787
mse 0.9707787
r2 0.13669747
residual_deviance 0.9707787
rmse 0.985281
rmsle 0.26205668
Scoring History:
timestamp duration number_of_trees training_rmse
1 2019-04-14 10:01:40 4 min 50.030 sec 0 1.06060
2 2019-04-14 10:01:42 4 min 51.451 sec 1 1.05580
3 2019-04-14 10:01:43 4 min 52.718 sec 2 1.05189
4 2019-04-14 10:01:49 4 min 58.693 sec 3 1.04845
5 2019-04-14 10:01:54 5 min 3.532 sec 8 1.03584
6 2019-04-14 10:01:59 5 min 8.351 sec 13 1.02770
7 2019-04-14 10:02:03 5 min 12.484 sec 16 1.02267
8 2019-04-14 10:02:08 5 min 17.282 sec 21 1.01622
9 2019-04-14 10:03:58 7 min 7.633 sec 24 1.01206
10 2019-04-14 10:04:14 7 min 24.026 sec 36 0.99583
11 2019-04-14 10:04:19 7 min 29.111 sec 41 0.99139
12 2019-04-14 10:04:24 7 min 34.045 sec 45 0.98676
13 2019-04-14 10:04:29 7 min 38.826 sec 49 0.98375
14 2019-04-14 10:04:30 7 min 40.141 sec 50 0.98302
training_mae training_deviance
1 0.85569 1.12487
2 0.85150 1.11471
3 0.84810 1.10647
4 0.84494 1.09925
5 0.83230 1.07297
6 0.82359 1.05616
7 0.81876 1.04585
8 0.81289 1.03270
9 0.80898 1.02426
10 0.79409 0.99168
11 0.78988 0.98286
12 0.78531 0.97370
13 0.78228 0.96777
14 0.78164 0.96633
Variable Importances: (Extract with `h2o.varimp`)
=================================================
Variable Importances:
variable relative_importance scaled_importance percentage
1 movieId 5096206.500000 1.000000 0.969387
2 userId 160936.781250 0.031580 0.030613
#Since the model gbdt_3 has the lower RMSE on training set,
# i evaluate performance on test set
h2o.performance(gbdt_3, test)
H2ORegressionMetrics: gbm
MSE: 0.9659998
RMSE: 0.9828529
MAE: 0.7816576
RMSLE: 0.261309
Mean Residual Deviance : 0.9659998
#i predict ratings on validation set and evaluate RMSE
pred.ratings.gbdt_3 <- h2o.predict(gbdt_3,as.h2o(valida.copy))
rmse_gbdt <- RMSE(pred.ratings.gbdt_3, as.h2o(valida.copy$rating))
rmse_gbdt
0.9839035
#clear unusued memory
invisible(gc())
#remove bar progress
h2o.no_progress()
# first rf model :
rf1 <- h2o.randomForest(
training_frame = train,
x= c("movieId" ,"userId" ,"timestamp", "n.movies_byUser","n.users_bymovie"),
y= "rating",
ntrees = 50,
max_depth = 20
)
summary(rf1)
Model Details:
==============
H2ORegressionModel: drf
Model Key: DRF_model_R_1554985446584_7
Model Summary:
number_of_trees number_of_internal_trees model_size_in_bytes min_depth
1 50 50 106340480 20
max_depth mean_depth min_leaves max_leaves mean_leaves
1 20 20.00000 72728 140247 105495.26000
H2ORegressionMetrics: drf
** Reported on training data. **
** Metrics reported on Out-Of-Bag training samples **
MSE: 0.9361753
RMSE: 0.9675615
MAE: 0.7676068
RMSLE: 0.2578431
Mean Residual Deviance : 0.9361753
Scoring History:
timestamp duration number_of_trees training_rmse
1 2019-04-14 10:13:07 0.009 sec 0 NA
2 2019-04-14 10:13:32 24.497 sec 1 1.02082
3 2019-04-14 10:13:43 35.832 sec 2 1.01305
4 2019-04-14 10:14:06 58.883 sec 3 1.00826
5 2019-04-14 10:14:15 1 min 7.689 sec 4 1.00373
training_mae training_deviance
1 NA NA
2 0.80340 1.04206
3 0.79847 1.02626
4 0.79546 1.01659
5 0.79269 1.00748
---
timestamp duration number_of_trees training_rmse
46 2019-04-14 10:20:55 7 min 48.304 sec 45 0.96815
47 2019-04-14 10:21:03 7 min 56.113 sec 46 0.96768
48 2019-04-14 10:21:11 8 min 3.654 sec 47 0.96775
49 2019-04-14 10:21:19 8 min 11.956 sec 48 0.96770
50 2019-04-14 10:21:27 8 min 20.243 sec 49 0.96763
51 2019-04-14 10:21:36 8 min 28.563 sec 50 0.96756
training_mae training_deviance
46 0.76811 0.93731
47 0.76770 0.93640
48 0.76777 0.93654
49 0.76773 0.93645
50 0.76767 0.93630
51 0.76761 0.93618
Variable Importances: (Extract with `h2o.varimp`)
=================================================
Variable Importances:
variable relative_importance scaled_importance percentage
1 movieId 16745293.000000 1.000000 0.381465
2 n.users_bymovie 11316529.000000 0.675804 0.257795
3 n.movies_byUser 6366762.500000 0.380212 0.145038
4 userId 4925214.000000 0.294125 0.112199
5 timestamp 4543522.500000 0.271331 0.103503
# second rf model : ntrees = 50, max.deptu = 20 , nfolds = 5
# rf2 <- h2o.randomForest(
# training_frame = train,
# x= c("movieId" ,"userId", "n.movies_byUser","n.users_bymovie"),
# y= "rating",
# ntrees = 50,
# max_depth = 20,
# nfolds = 5
# )
#
# summary(rf2)
#third rf model : ntrees = 50, max.depth = 20 , nfolds = 3
rf_3 <- h2o.randomForest(
training_frame = train,
x= c("movieId" ,"userId"),
y= "rating",
nfolds=3,
seed=1,
keep_cross_validation_predictions = TRUE,
fold_assignment = "Random"
)
summary(rf_3)
Model Details:
==============
H2ORegressionModel: drf
Model Key: DRF_model_R_1554985446584_8
Model Summary:
number_of_trees number_of_internal_trees model_size_in_bytes min_depth
1 50 50 182067312 20
max_depth mean_depth min_leaves max_leaves mean_leaves
1 20 20.00000 40110 75349 58403.06000
H2ORegressionMetrics: drf
** Reported on training data. **
** Metrics reported on Out-Of-Bag training samples **
MSE: 0.9120648
RMSE: 0.9550208
MAE: 0.7518812
RMSLE: 0.2540035
Mean Residual Deviance : 0.9120648
H2ORegressionMetrics: drf
** Reported on cross-validation data. **
** 3-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
MSE: 0.9133628
RMSE: 0.9557002
MAE: 0.7524639
RMSLE: 0.2542855
Mean Residual Deviance : 0.9133628
Cross-Validation Metrics Summary:
mean sd cv_1_valid cv_2_valid
mae 0.752463 5.8236887E-4 0.7531027 0.7529861
mean_residual_deviance 0.91336036 0.0016835167 0.9150332 0.91505456
mse 0.91336036 0.0016835167 0.9150332 0.91505456
r2 0.18803298 0.0013799736 0.18626863 0.18707715
residual_deviance 0.91336036 0.0016835167 0.9150332 0.91505456
rmse 0.9556981 8.8118477E-4 0.95657367 0.9565849
rmsle 0.25428495 2.328422E-4 0.25444627 0.2545826
cv_3_valid
mae 0.7513002
mean_residual_deviance 0.90999335
mse 0.90999335
r2 0.19075316
residual_deviance 0.90999335
rmse 0.95393574
rmsle 0.25382596
Scoring History:
timestamp duration number_of_trees training_rmse
1 2019-04-14 10:47:28 24 min 43.072 sec 0 NA
2 2019-04-14 10:47:42 24 min 57.091 sec 1 1.00942
3 2019-04-14 10:47:49 25 min 3.508 sec 2 1.00699
4 2019-04-14 10:47:55 25 min 10.047 sec 3 1.00384
5 2019-04-14 10:48:02 25 min 16.348 sec 4 0.99281
training_mae training_deviance
1 NA NA
2 0.79478 1.01893
3 0.79059 1.01402
4 0.78826 1.00769
5 0.77935 0.98566
---
timestamp duration number_of_trees training_rmse
46 2019-04-14 10:52:05 29 min 19.560 sec 45 0.95540
47 2019-04-14 10:52:10 29 min 24.845 sec 46 0.95528
48 2019-04-14 10:52:16 29 min 30.937 sec 47 0.95495
49 2019-04-14 10:52:22 29 min 36.816 sec 48 0.95493
50 2019-04-14 10:52:28 29 min 42.346 sec 49 0.95492
51 2019-04-14 10:52:33 29 min 47.573 sec 50 0.95502
training_mae training_deviance
46 0.75219 0.91279
47 0.75206 0.91256
48 0.75179 0.91194
49 0.75176 0.91188
50 0.75178 0.91186
51 0.75188 0.91206
Variable Importances: (Extract with `h2o.varimp`)
=================================================
Variable Importances:
variable relative_importance scaled_importance percentage
1 movieId 38668052.000000 1.000000 0.716631
2 userId 15290050.000000 0.395418 0.283369
#Since the model rf_3 has the lower RMSE on training set,
# i evaluate performance on test set
h2o.performance(rf_3, test)
H2ORegressionMetrics: drf
MSE: 0.9087191
RMSE: 0.9532676
MAE: 0.7507651
RMSLE: 0.2534143
Mean Residual Deviance : 0.9087191
#i predict ratings on validation set and evaluate RMSE
pred.ratings.rf_3 <- h2o.predict(rf_3,as.h2o(valida.copy))
rmse_rf <- RMSE(pred.ratings.rf_3, as.h2o(valida.copy$rating))
rmse_rf
0.9543947
#stacked Ensemble : i take the best two previous models (gbdt_3 and rf_3)
ensemble <- h2o.stackedEnsemble(x = c("movieId" ,"userId"),
y = "rating",
training_frame = train,
model_id = "my_ensemble_auto",
base_models = list(gbdt_3@model_id, rf_3@model_id))
#i predict ratings on validation set and evaluate RMSE
pred.ratings.ensemble <- h2o.predict(ensemble,as.h2o(valida.copy))
rmse_ensemble <- RMSE(pred.ratings.ensemble, as.h2o(valida.copy$rating))
rmse_ensemble
0.9493815
rmse_results <- data.frame(methods=c("gradient Boosting","random forest","stacked ensemble"),rmse = c(rmse_gbdt, rmse_rf, rmse_ensemble))
kable(rmse_results) %>%
kable_styling(bootstrap_options = "striped" , full_width = F , position = "center") %>%
kable_styling(bootstrap_options = "bordered", full_width = F , position ="center") %>%
column_spec(1,bold = T ) %>%
column_spec(2,bold = T ,color = "white" , background ="#D7261E")
methods | rmse |
---|---|
gradient Boosting | 0.9839035 |
random forest | 0.9543947 |
stacked ensemble | 0.9493815 |
#remove objects
rm(edx.copy,valida.copy)
#close the cluster h2o
h2o.shutdown()
## Are you sure you want to shutdown the H2O instance running at http://localhost:54321/ (Y/N)?
We observed that the best gbdt and rf models (gbdt_3 and rf_3) have the lowest Root Mean Squared errors on the training set and produced rmse values on the validation set equals to 0.9839035 and 0.9543947, respectively . For the gbm model this happened when with 50 trees and max depth equals to 5, we only take into account the movieId and UserId (factor vectors) features, and with 3 folds cross validation .For the rf model , this happened when with 3 folds cross validation we trained 50 trees with max depth equals to 20 for the movieId and UserId (factor vectors) features. However, performance appears slower in RF than in GBM since RF can go to depth 20, which can lead to up to 1+2+4+8+…+2^19 ~ 1M nodes to be split, and for every one of them, mtries=sqrt(4600)=67 columns need to be considered for splitting. Instead, GBM can go to depth 5, so only 1+2+4+8+16 = 31 nodes to be split, and for every one of them, all 4600 columns need to be considered. The rmse’s values on validation set for the gbdt , rf and stacked ensemble models are summarized in table above.
Based on the differents RMSE’s values we found for Linear regression models, recommenders algorithms and Ensemble models, we choose the two candidates which best predict ratings on the validation set : Linear regression model with Regularized movie + user effect (i) and Matrix factorization with stochastic gradient (ii).
#i create a copy of training(edx) and validation sets where i retain only userId, movieId and rating features. i rename the three columns.
edx.copy <- edx %>%
select(-c("genres","title","timestamp"))
names(edx.copy) <- c("user", "item", "rating")
valid.copy <- validation %>%
select(-c("genres","title","timestamp"))
names(valid.copy) <- c("user", "item", "rating")
#as matrix
edx.copy <- as.matrix(edx.copy)
valid.copy <- as.matrix(valid.copy)
#write edx.copy and valid.copy tables on disk
write.table(edx.copy , file = "trainset.txt" , sep = " " , row.names = FALSE, col.names = FALSE)
write.table(valid.copy, file = "validset.txt" , sep = " " , row.names = FALSE, col.names = FALSE)
#data_file(): Specifies a data set from a file in the hard disk.
set.seed(123) # This is a randomized algorithm
train_set <- data_file(system.file( "dat" ,"trainset.txt" , package = "recosystem"))
valid_set <- data_file(system.file( "dat" ,"validset.txt" , package = "recosystem"))
#Next step is to build Recommender object
r = Reco()
#Optimizing/tuning the recommender model
opts <- r$tune(train_set , opts = list(dim = c(1:20), lrate = c(0.05),
nthread = 4 , costp_l1=0,
costq_l1 = 0,
niter = 50, nfold = 10,
verbose = FALSE))
#trains the recommender model
r$train(train_set, opts = c(opts$min , nthread = 4, niter = 100,verbose=FALSE))
#Making prediction on validation set and calculating RMSE:
pred_file = tempfile()
r$predict(valid_set, out_file(pred_file))
prediction output generated at C:\Users\LD493E~1.MON\AppData\Local\Temp\Rtmp2nldyL\file1850377a7d83
#Matrix Factorization : show first 10 predicted values
print(scan(pred_file, n = 10))
4.63566 4.97483 4.98619 4.08538 4.71547 2.88960 4.14103 4.39614
4.65707 3.52472
#valid_set
scores_real <- read.table("validset.txt", header = FALSE, sep = " ")$V3
scores_pred <- scan(pred_file)
# remove edx.copy and valid.copy objects
rm(edx.copy, valid.copy)
rmse_mf_opt <- RMSE(scores_real,scores_pred)
rmse_mf_opt
0.7887485
load("EnvCapstone_MatrixFacto_trainRmse.RData")
iter.line <- 15
tr_rmse.line <- mat.facto_rmse$tr_rmse[which(mat.facto_rmse$iter==15)]
mat.facto_rmse %>%
ggplot(aes(x=iter, y = tr_rmse))+
geom_point(size= 5 , shape = 19 ) +
geom_smooth(aes(x= iter, y = tr_rmse)) +
geom_segment(x=0,xend=iter.line ,y=tr_rmse.line,yend=tr_rmse.line, color="red", lty=2)+
geom_segment(x=iter.line, xend=iter.line, y=0, yend=tr_rmse.line, color="red", lty=2) +
annotate(geom="label",x = iter.line,y = 0.8350,color=2,
label=paste("x=",round(iter.line,0),"\ny=",round(tr_rmse.line ,4)))+
labs(title="RMSE for different number of latent factors" ,
caption = "based on the output of r$train(train_set, opts = c(opts$min, nthread = 4, niter = 100), \n show just first 30 iterations)"
) +
ylab("RMSE") +
xlab("Latent factors")
We focused on optimizing the performance of (ii) using 10 fold cross validation ( nfolds = 10). This consisted to understand the behavior of the rmse value for the Matrix Factorization recommender engine when modifying values of number of latent factors (dim), gradient descend step rate (lrate), penalty parameters to avoid overfitting (cost) and number of threads for parallel computing (nthread). The figure above shows the number of latent factors vs. cross-validation RMSE . We noticed that the training model just needed 15 latent factors to reach a rmse value lower than 0.8. To sum up, we observed that even with these optimal criteria, the RMSE value for the Matrix factorization method is still below 0.8.(see value of rmse_mf_opt)
This MovieLens project has examined the potential best recommender system algorithm to predict movie ratings for the 10M version of the Movielens data. Using the provided training set (edx) and validation set, we successively trained different linear regression models, recommender engines and Ensemble methods. The model evaluation performance through the RMSE ( root mean squared error) showed that the Linear regression model with regularized effects on users and movies, and the Matrix Factorization with Stochastic gradient descent are the two appropriate recommender systems to predict ratings on the validation set. Using 10 fold cross-validation to increase the performance of the Matrix Factorization method , in the sense of understanding the behavior of the rmse value with changing optimal criteria( dim, nthread, lrate, lim), we winded up a rmse value always less than 0.8.
Future work includes further investigations on Ensemble methods in order to get lower RMSE values than the 0.9 order we got in Ensemble methods section. GBDT combine some advantages like including an ability to find non-linear transformations, ability to handle skewed variables without requiring transformations, computational robustness (e.g., highly collinear variables are not an issue) and high scalability. They also naturally lend themselves to parallelization. This has made them a good choice for several large scale practical problems such as ranking results of a search engine (Bellkor, 2009). Following this Bellkor winning solution, to get a rmse value less than 0.88 on Ensemble methods we should combine over 500 models adding not only the clustering effects of Number of movies each user rated, Number of users that rated each movie, but also hidden units of a restricted Boltzmann Machine.
Aggarwal, C. C. (2016). Recommender systems (pp. 1-28). Cham: Springer International Publishing.
Ricci, F., Rokach, L., & Shapira, B. (2015). Recommender systems: introduction and challenges. In Recommender systems handbook (pp. 1-34). Springer, Boston, MA.
Irizzary,R., 2018,Introduction to Data Science,github page,https://rafalab.github.io/dsbook/
Shani, G., & Gunawardana, A. (2011). Evaluating recommendation systems. In Recommender systems handbook (pp. 257-297). Springer, Boston, MA.
Saranya, K. G., Sadasivam, G. S., & Chandralekha, M. (2016). Performance comparison of different similarity measures for collaborative filtering technique. Indian J. Sci. Technol, 9(29), 1-8.
Koren, Y. (2009). The bellkor solution to the netflix grand prize. Netflix prize documentation, 81, 1-10.
Chin, W. S., Zhuang, Y., Juan, Y. C., & Lin, C. J. (2015). A fast parallel stochastic gradient method for matrix factorization in shared memory systems. ACM Transactions on Intelligent Systems and Technology (TIST), 6(1), 2.
Friedman, Jerome H. “Greedy Function Approximation: A Gradient Boosting Machine.” Annals of Statistics (2001): 1189-1232.
Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical learning (Vol. 1, No. 10). New York: Springer series in statistics.
Some notes: Since the online book " Irizzary,R., 2018,Introduction to Data Science,github page,https://rafalab.github.io/dsbook/ " has undergone recent changes, you should consider in my reports : Dimension reduction paragraph(3.3) https://rafalab.github.io/dsbook/matrix-factorization.html moves to https://rafalab.github.io/dsbook/large-datasets.html#matrix-factorization
Evaluated Algorithms paragraph(4.1) https://rafalab.github.io/dsbook/recommendation-systems.html moves to https://rafalab.github.io/dsbook/large-datasets.html#recommendation-systems.
Thank you for pointing out any optics aiming to bring any corrections in both substance and form.
Louis Mono