2

I have a Spark Dataframe of Userid, ItemID, Ratings. I am building a recommender system.

The data looks like this:

originalDF.show(5)
+----+----+------+
|user|item|rating|
+----+----+------+
| 353|   0|     1|
| 353|   1|     1|
| 353|   2|     1|
| 354|   3|     1|
| 354|   4|     1|
+----+----+------+

It has 56K unique users and 8.5K unique items.

So if you see each UserID has a record (RoW) for each Item and corresponding rating. So multiple records per user id.

Now I split this into train, val and test by taking a random split of 0.6,0.2,0.2 %. So basically 60% of random records go for training, 20% for validation and remaining 20% for test as below:

random_split=originalDF.randomSplit(split_perc,seed=20)

return random_split[0],random_split[1],random_split[2]

This leaves me with following dataset counts

train,validation,test=train_test_split(split_sdf,[0.6,0.2,0.2])
​
print "Training size is {}".format(train.count())
print "Validation size is {}".format(validation.count())
print "Test size is {}".format(test.count())
'/'
print "Original Dataset Size is {}".format(split_sdf.count())
Training size is 179950
Validation size is 59828
Test size is 60223
Original Dataset Size is 300001

Now I train the Spark pyspark.ml.ALS algorithm on training data.

als = ALS(rank=120, maxIter=15, regParam=0.01, implicitPrefs=True)
model = als.fit(train)

When I check the userFactors and itemFactors from the model object I get this:

itemF=model.itemFactors
itemF.toPandas().shape
Out[111]:
(7686, 2)
In [113]:

userF=model.userFactors
userF.toPandas().shape
Out[113]:
(47176, 2)

Which means it is only giving me a predicted factor matrix of the no. of unique users and items in training data.

Now how do I get prediction for all the items for each user?.

If I do

prediction=model.transform(originalDF)

where OriginalDF is the whole dataset which was broken into train,val and test would that give prediction for all items for each user?.

My question is if my dataset had 56K users X 8.5K items then I want to find prediction matrix for the same 56K X8.5K and not just the 47K X7.6K training data.

What am I doing wrong here?. I understand the data works only on 47k X7.6K training data instead of the original 56k X8.5K ratings data. So am I splitting the data into train,val wrong?

I know for recommender system one should randomly mask some ratings for some items for each user and use the remaining for training and test it on those masked values. I did the same here since each record for a user is a rating for a different item. When we split randomly we are essentially masking some of the rating for a user and not using them for training.

Please advise.

Edit:

In a typical Recommender System with user X item matrix (56K users X 8.5 items)

We basically mask (make it to 0) some random item ratings for each user. Then this whole matrix is passed to the recommender algo and it breaks it into a product of two factors matrix.

However in Spark, we don't use a Userx item matrix. We basically put each item column ratings as individual row for each user instead of having 8.5K item columns.

So if you see masking (making some item ratings to 0) in original user-item matrix is then same as not using some random rows for each user in spark data frame. Right?

Here is I found one way to split (which is what I used too) the data into train and val

training_RDD, validation_RDD, test_RDD = small_ratings_data.randomSplit([6, 2, 2], seed=0L)
validation_for_predict_RDD = validation_RDD.map(lambda x: (x[0], x[1]))
test_for_predict_RDD = test_RDD.map(lambda x: (x[0], x[1]))

I used the similar randomSplit thing here too. So I am not sure what is wrong here.

I can understand that since the training data does not have all users and items, the item factors matrix would also only have that many user and item factors. So how do I overcome that?. In the end I basically needs a matrix of predictions for all users and items.

Baktaawar
  • 7,086
  • 24
  • 81
  • 149
  • 1
    Possible duplicate of [Spark ALS predictAll returns empty](http://stackoverflow.com/questions/37379751/spark-als-predictall-returns-empty) –  Nov 08 '16 at 20:04
  • What is the issue here? Yes, you can use your model trained on `train` to predict `test`, or the `originalDF`, but the latter won't give a good estimate of model performance. – mtoto Nov 08 '16 at 20:06

1 Answers1

1

All ids of:

  • users
  • products

for which you want predictions have to be present in the training set. Using random split is not a method which can be used to ensure that (it is not equivalent to data masking).

  • 1
    Then what is the method?. In recommender system, one has to mask some ratings for each user. That means we are basically not using some of these ratings in training data. This is what I did using randomSplit. What do you think is a better alternative? – Baktaawar Nov 08 '16 at 20:33
  • You may have a better luck of using stratified sampling by user (if predictions for all users are more important) or by product (if predictions for all products) are more important. You can also add missing values after first sampling. –  Nov 08 '16 at 20:40
  • 1
    I'm afraid the technique you describe in your first comment doesn't apply well with the implementation provided by spark. – eliasah Nov 08 '16 at 21:17
  • Which technique?. I am using ALS from spark so that has the implementation in spark already – Baktaawar Nov 08 '16 at 21:40
  • 1
    The technique about masking one user rating. The issue that comes up when you split is that you need to have all the users and all the items that needs learned to be in the training set. Thus stratisfied sampling is a way to go. If a certain user X or item Y aren't learning. ALS considers them as a new user and item and this falls under the cold start problem wing. – eliasah Nov 09 '16 at 07:55
  • Yeah I understand your point of having all users and items in the training set as that is what is happening here and I am seeing less no. of user /item factors. So you suggest using stratified sampling by users? – Baktaawar Nov 10 '16 at 04:50
  • @eliasah am not sure how stratified sampling will help? Instead of just taking Random sample of users in train, test etc, if I do stratified sampling (lets say by users) so for each user strata we would get the sample of product id and ratings based on the proportion in the whole dataset. But then even that doesn't keep all ratings for each user. So some ratings would go to validation and test and then same problem would be there that the rating was not present in the training data. Right? Or am I missing something – Baktaawar Nov 10 '16 at 21:23
  • You don't need __all ratings for each user__. You need __some ratings__ for __each user__ and __each item__. If you haven't seen item or user in the training set ALS cannot tell you anything about it. –  Nov 11 '16 at 00:39
  • ok I got that. But one thing I am not able to grasp completely that how does Straified Sampling helps?. If we do stratified sampling by users then I am not sure how it makes sure of the point above? – Baktaawar Nov 11 '16 at 21:11
  • Can someone suggest how to do the stratified sampling here?. I am not sure I understand the usage here. – Baktaawar Nov 14 '16 at 18:43
  • @Baktaawar Did you ever find a way to mask data to properly split for training and validation? I am struggling with the same issue. I've attempted to use the function [sampleBy] (https://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html) but since it does not truly mask data, I am still getting AUC = 1.0 – Archimeow Dec 09 '16 at 00:04
  • @JMeo In my opinion AUC would be 1.0 if your data does not have any zero ratings. If all it has is 1 and above, then I don't think you would have any false positive. I would recommend, get all combination of user,item. If you check your predictions matrix, there would be few Nul values too. Thats because the data doesn't have all combination of user,item. For items a user has not seen/rated, put those as zeros. But if your data has 10 unique users and 5 unique items. Then you should have total 10*5=50 rows in rdd. 5 for each user. – Baktaawar Dec 09 '16 at 18:27
  • @Baktaawar Thank you for your response. Do you know if adding in zeros for unviewed/unclicked items is something pyspark.mllib.recommendation.train() does, or is this something I must do before I train the model? I've looked through the Scala source code, but it is unclear to me if this is happening. – Archimeow Dec 14 '16 at 17:52
  • @Jmeo. No they don't. That is why you get AUC 1.0. Also if your matrix does'nt have all user-item combination your prediction matrix will have some null values corresponding to item/user combination it has not seen. So that is why you need to make sure you have all user and item combination as Row in your data frame. This is unique to Spark since in original algo this problem is not there as you have a wide matrix (user-item). In spark you need a tuple of (userid,itemid,rating). So if you data frame has 100 unique users and 50 unique items, you RDD/DF would be 100*50=50000 rows. – Baktaawar Dec 14 '16 at 18:42