I have a Spark Dataframe of Userid, ItemID, Ratings. I am building a recommender system.
The data looks like this:
originalDF.show(5)
+----+----+------+
|user|item|rating|
+----+----+------+
| 353| 0| 1|
| 353| 1| 1|
| 353| 2| 1|
| 354| 3| 1|
| 354| 4| 1|
+----+----+------+
It has 56K unique users and 8.5K unique items.
So if you see each UserID has a record (RoW) for each Item and corresponding rating. So multiple records per user id.
Now I split this into train, val and test by taking a random split of 0.6,0.2,0.2 %. So basically 60% of random records go for training, 20% for validation and remaining 20% for test as below:
random_split=originalDF.randomSplit(split_perc,seed=20)
return random_split[0],random_split[1],random_split[2]
This leaves me with following dataset counts
train,validation,test=train_test_split(split_sdf,[0.6,0.2,0.2])
print "Training size is {}".format(train.count())
print "Validation size is {}".format(validation.count())
print "Test size is {}".format(test.count())
'/'
print "Original Dataset Size is {}".format(split_sdf.count())
Training size is 179950
Validation size is 59828
Test size is 60223
Original Dataset Size is 300001
Now I train the Spark pyspark.ml.ALS algorithm on training data.
als = ALS(rank=120, maxIter=15, regParam=0.01, implicitPrefs=True)
model = als.fit(train)
When I check the userFactors and itemFactors from the model object I get this:
itemF=model.itemFactors
itemF.toPandas().shape
Out[111]:
(7686, 2)
In [113]:
userF=model.userFactors
userF.toPandas().shape
Out[113]:
(47176, 2)
Which means it is only giving me a predicted factor matrix of the no. of unique users and items in training data.
Now how do I get prediction for all the items for each user?.
If I do
prediction=model.transform(originalDF)
where OriginalDF is the whole dataset which was broken into train,val and test would that give prediction for all items for each user?.
My question is if my dataset had 56K users X 8.5K items then I want to find prediction matrix for the same 56K X8.5K and not just the 47K X7.6K training data.
What am I doing wrong here?. I understand the data works only on 47k X7.6K training data instead of the original 56k X8.5K ratings data. So am I splitting the data into train,val wrong?
I know for recommender system one should randomly mask some ratings for some items for each user and use the remaining for training and test it on those masked values. I did the same here since each record for a user is a rating for a different item. When we split randomly we are essentially masking some of the rating for a user and not using them for training.
Please advise.
Edit:
In a typical Recommender System with user X item matrix (56K users X 8.5 items)
We basically mask (make it to 0) some random item ratings for each user. Then this whole matrix is passed to the recommender algo and it breaks it into a product of two factors matrix.
However in Spark, we don't use a Userx item matrix. We basically put each item column ratings as individual row for each user instead of having 8.5K item columns.
So if you see masking (making some item ratings to 0) in original user-item matrix is then same as not using some random rows for each user in spark data frame. Right?
Here is I found one way to split (which is what I used too) the data into train and val
training_RDD, validation_RDD, test_RDD = small_ratings_data.randomSplit([6, 2, 2], seed=0L)
validation_for_predict_RDD = validation_RDD.map(lambda x: (x[0], x[1]))
test_for_predict_RDD = test_RDD.map(lambda x: (x[0], x[1]))
I used the similar randomSplit thing here too. So I am not sure what is wrong here.
I can understand that since the training data does not have all users and items, the item factors matrix would also only have that many user and item factors. So how do I overcome that?. In the end I basically needs a matrix of predictions for all users and items.