Spark: Measuring performance of ALS

Question

I am using the ALS model from spark.ml to create a recommender system using implicit feedback for a certain collection of items. I have noticed that the output predictions of the model are much lower than 1 and they usually range in the interval of [0,0.1]. Thus, using MAE or MSE does not make any sense in this case.

Therefore I use the areaUnderROC (AUC) to measure the performance. I do that by using the spark's BinaryClassificationEvaluator and I do get something close to 0.8. But, I cannot understand clearly how that is possible, since most of the values range in [0,0.1].

To my understanding after a certain point the evaluator will be considering all of the predictions to belong to class 0. Which essentially would mean that AUC would be equal to the percentage of negative samples?

In general, how would you treat such low values if you need to test your model's performance compared to let's say Logistic Regression?

I train the model as follows:

rank = 25
alpha = 1.0
numIterations = 10
als = ALS(rank=rank, maxIter=numIterations, alpha=alpha, userCol="id", itemCol="itemid", ratingCol="response", implicitPrefs=True, nonnegative=True)
als.setRegParam(0.01)
model = als.fit(train)

score 3 · Answer 1 · edited May 23 '17 at 12:25

What @shuaiyuancn explained about BinaryClassificationEvaluator isn't completely correct. Obviously using that kind of evaluator if you don't have binary ratings and a proper threshold isn't correct.

Thus, you can consider a recommender system as a binary classification when your systems considers binary ratings (click-or-not, like-or-not).

In this case, the recommender defines a logistic model, where we assume that the rating (-1,1) that user u gives item v is generated on a logistic response model :

$y_{uv} \sim Bernoulli((1 + exp[-score_{uv}])^1)$

where score_uv is the score given by u to v.

For more information about Logistic Models, you can refer to Hastie et al. (2009) - section 4.4

This said, a recommender system can also be considered as multi-class classification problem. And this always depends on your data and the problem in hand but it can also follow some kind of regression model.

Sometimes, I choose to evaluate my recommender system using RegressionMetrics even thought text books recommend using RankingMetrics-like evaluations to compute metrics such as average precision at K or MAP, etc. It always depends on the task and data at hand. There is no general recipe for that.

Nevertheless, I strongly advise you to read the Evaluation Metrics official documentation. It will help you understand better what you are trying to measure regarding what you are trying to achieve.

References

Statistical Methods for Recommender Systems - Deepak K. Agarwal, Bee-Chung Chen.
The Elements of Statistical Learning - Hastie et al.
Spark official documentation - Evaluation Metrics.

EDIT: I ran into this answer today. It's an example implementation of a Binary ALS in python. I strongly advise you to take a look at it.

score 2 · Answer 2 · answered Jun 24 '16 at 15:41

2

Using BinaryClassificationEvaluator on a recommender is wrong. Usually a recommender select one or a few items from a collection as prediction. But BinaryClassificationEvaluator only deals with two labels, hence Binary.

The reason you still get a result from BinaryClassificationEvaluator is that there is a prediction column in your result dataframe which is then used to compute the ROC. The number doesn't mean anything in your case, don't take it as a measurement of your model's performance.

I have noticed that the output predictions of the model are much lower than 1 and they usually range in the interval of [0,0.1]. Thus, using MAE or MSE does not make any sense in this case.

Why MSE doesn't make any sense? You're evaluating your model by looking at the difference (error) of predicted rating and the true rating. [0, 0.1] simply means your model predicts the rating to be in that range.

answered Jun 24 '16 at 15:41

shuaiyuancn

2,744
3
24
32

In that case, it essentially means that ALS was not able to capture any patterns in the data. I would expect to get results in the range of [0, 1], however I get only very low values < 0.1. Thus, error is going to be very high for positive samples. – ml_0x Jun 25 '16 at 11:19
It is the trained model that doesn't make sense, not the metrics :) – shuaiyuancn Jun 27 '16 at 08:56
Yes, you are right. I don't want to cause any mis-interpretations. I chose to use a different metric because of the model's results, which yes, they don't seem to make much sense. – ml_0x Jun 27 '16 at 13:36

Spark: Measuring performance of ALS

2 Answers2