A data scientist who is experiended in various fields of engineering, project management, and data analytics. In this website, you can find some of my personal/professional projects.
Personal Email: kwonkh0424@gmail.com
View My LinkedIn Profile
Project Description: The purpose of this experiment was to develop a distributed recommendation system using MongoDB and SparkML with MovieLens dataset.
MovieLens data with 28 million records of movie rating (1~5) and metadata of all movies in IMDB database.
The original dataset can be found here: https://grouplens.org/datasets/movielens/
The preprocessed data of rating and movie descriptions were saved into MongoDB database. The data is accessed on Databricks notebook and converted into SparkDF to utilize distributive computing to significantly reduce computational cost.
By using SparkML package, several ML algorithms were evaluated to find the best performing algorithm that accurately predicts the rating.
ALS is a type of model-based collaborative filtering. It uses Matrix Factorization to predict the missing values in the User and Item embedding matrix. Users and Items are represented by latent factors that can be used to predict ratings for new users and items This Netflix Prize competition-winning algorithm was developed by Yehuda Koren and Robert Bell. Link: https://dl.acm.org/doi/10.1109/MC.2009.263
The model was the first hyperparameter tuned using 4-fold cross-validation with 20 percent of the dataset. With the best performing parameters, the final predictions were made for the testing dataset.
ML models were evaluated using RMSE, MSE, and R^2. Here are the results:
Due to the complexity of the given dataset (3000+ categorical classes), the linear regression model didn’t perform well. RF and Gradient Boosting performed fairly well. The Netflix competition-winning ALS algorithm performed the best with an MSE value of 0.388. The model time-effectively trained under 2 hours, which was due to the distributive computing capability of SparkML.
In order to verify the recommendation system, a new user was manually inserted into the dataset. This user-provided high ratings (4-5) for the animated Disney and Pixar movies and low ratings (1-2) for horror and thriller movies. The question was which movies would this recommendation suggest to this fictional user.
The image below shows the top 5 movies that the ALS model recommended to the user. As shown, the recommendation was fairly reasonable except for one of the movies.