I want to use ALS(Alternating Least Squares matrix factorization) to get some prediction from training data. In my previous understanding, ALS in mllib
and ml
package do the same job, which means when the training data and test data is same, both method would have same output.
However, maybe I am worry. Look following code:
from pyspark import SparkContext
sparkC = SparkContext()
sqlC = SQLContext(sparkC)
trainData = sparkC.textFile("Data/trainData.txt").map(lambda line:line.split("\t"))
testData = sparkC.textFile("Data/testData.txt").map(lambda line: line.split("\t"))
print(testData.count()) # output1
#---------when use ml package----------------
from pyspark.ml.recommendation import ALS
als = ALS(rank = 10,maxIter = 20)
model = als.fit(trainDataFrame)
predTestData = model.transform(testDataFrame)
print(predTestData.count()) #### output2
#----------------------------------------------
#---------when use mllib package----------------
from pyspark.mllib.recommendation import ALS
model = ALS.train(trainData, 10, seed=3, iterations=20)
predTestData = model.predictAll(testData).\
map(lambda r: (r.user, r.product, r.rating))
print(predTestData.count()) #### output3
In above code, the training data and test data are same when I use ml
and mllib
respectively. However the output is different. Moreover, the number of predictions should be equal to the number of test data. However, in my case, input1 = input2, that is well, but input3 < input1, which means some predictions disappear!!
what cause this? Or ALS in ml
is different with ALS in 'mllib`?