Alternating Least Squares (ALS) Spark ML

Alternating Least Squares (ALS) for Santander Kaggle competition

The Kaggle Santander competition just concluded. I decided for this competition to continue my learning process of spark environment and invest time in understanding how to do recommendation using Apache Spark. This was not a successful choice considering competition leader board :), but it gave me the opportunity to learn new strategy to work with data. I’m pretty sure I can do better with this kind of approach, learning more on how collaborative filtering works.

The recommendation strategy

Citing wikipedia :

Recommender systems or recommendation systems (sometimes replacing “system” with a synonym such as platform or engine) are a subclass of information filtering system that seek to predict the “rating” or “preference” that a user would give to an item.”

The main idea is to build a matrix users X items rating values and try to factorize it, to recommend main products rated by other users.

ALS

Apache Spark ML implements alternating least squares (ALS) for collaborative filtering, a very popular algorithm for making recommendations.

ALS recommender is a matrix factorization algorithm that uses Alternating Least Squares with Weighted-Lamda-Regularization (ALS-WR). It factors the user to item matrix A into the user-to-feature matrix U and the item-to-feature matrix M: It runs the ALS algorithm in a parallel fashion.  The ALS algorithm should uncover the latent factors that explain the observed user to item ratings and tries to find optimal factor weights to minimize the least squares between predicted and actual ratings.

The data at our disposal from kaggle site

In this competition, you are provided with 1.5 years of customers behavior data from Santander bank to predict what new products customers will purchase. The data starts at 2015-01-28 and has monthly records of products a customer has, such as “credit card”, “savings account”, etc. You will predict what additional products a customer will get in the last month, 2016-06-28, in addition to what they already have at 2016-05-28. These products are the columns named: ind_(xyz)_ult1, which are the columns #25 – #48 in the training data. You will predict what a customer will buy in addition to what they already had at 2016-05-28

So we have to select which among the 24 products a user will buy in the 2016-06-28 with respect to what he already has in the previous month.

 

ALS with explicit preferences

Following the discussion on kaggle forum, I tried to use one given month for the users  as explicit product rating.

The code

ALS using Spark 2.0+

Reading data from disk

Selecting the training and comparison data set

Need to change the format of input data in rating (userId, productId,rating)

Now, each of our target product will be an int corresponding number to which we want to associate a rating

We need to change the data format: for each user and product we need a rate

The test set is the one containing the userId and the product not still ‘rated’ we want to ‘recommend’

Save model on disk

Determining what product user userId has not already bought and rated so that we can make new products recommendations

ncodpersitemCol
015889[(12, 0.0619866065681), (22, 0.0), (1, 0.00089…
115890[(22, 0.0), (1, 0.000633358489722), (13, 0.025…
215892[(22, 0.0), (1, 0.00121831032448), (13, 0.0379…
315893[(12, 0.0262373052537), (22, 0.0), (1, 0.00067…
415894[(22, 0.0), (1, 0.00124242738821), (13, 0.0272…
515895[(22, 0.0), (1, 0.00119186483789), (13, 0.0441…
615896[(22, 0.0), (1, 1.03697550458e-06), (13, 0.000…
715897[(22, 0.0), (1, 0.00129512546118), (6, 0.00380…
815898[(12, nan), (22, nan), (1, nan), (13, nan), (6…
915899[(12, 0.0511704608798), (22, 0.0), (1, 0.00084…

ncodpersitemCol
015889[(12, 0.0619866065681), (22, 0.0), (1, 0.00089…

ncodpersitemCol
015889[(12, 0.0619866065681), (22, 0.0), (1, 0.00089…
115890[(22, 0.0), (1, 0.000633358489722), (13, 0.025…
215892[(22, 0.0), (1, 0.00121831032448), (13, 0.0379…
315893[(12, 0.0262373052537), (22, 0.0), (1, 0.00067…
415894[(22, 0.0), (1, 0.00124242738821), (13, 0.0272…
515895[(22, 0.0), (1, 0.00119186483789), (13, 0.0441…
615896[(22, 0.0), (1, 1.03697550458e-06), (13, 0.000…
715897[(22, 0.0), (1, 0.00129512546118), (6, 0.00380…
815898[(12, nan), (22, nan), (1, nan), (13, nan), (6…
915899[(12, 0.0511704608798), (22, 0.0), (1, 0.00084…

Do prediction on not rated products and prepare the submission file

 

Public LeaderBoard score is 0.0162316 !:(

 Consider Implicit Preferences

I tried also different setup for the problem using the full data set over the entire period of data. The idea was to identify as implicit preferences the number or mean of products the users got in the past period. This could be seen as an implicit rating.

To transform the data we had over 17months of time in a recommendation-like problem, I selected the user preference as implicit preference by summing or averaging all the product they acquired during the entire period.

The number of time they bought a product is an implicit positive rating of the product itself and this can be used to do recommendation for other users.

Reading full data set

Need to change the format of input data in rating (userId,productId,rate)

Rating

To consider this problem as for implicit training, I considered the mean of the products bought by the user during the 1.5 year
as the rate for the product itself.

 Leader Board score is very very disappointing: only 0.003522!!!     🙁

 

Useful links

Here it is a list of blog post I found very useful to understand the recommendation and collaborative filtering world.

2
Leave a Reply

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

  Subscribe  
Notify of
Jenny
Guest
Jenny

Elena:

In your code under ALS with explicit preference, I see that “implicitPrefs=True” (in the first ALS fit). Is that a copy-n-paste error? Or it was really ran like that? Is so, then that ALS was implicit not explicit.