I have the following Python test code (the arguments to ALS.train
are defined elsewhere):
r1 = (2, 1)
r2 = (3, 1)
test = sc.parallelize([r1, r2])
model = ALS.train(ratings, rank, numIter, lmbda)
predictions = model.predictAll(test)
print test.take(1)
print predictions.count()
print predictions
Which works, because it has a count of 1 against the predictions variable and outputs:
[(2, 1)]
1
ParallelCollectionRDD[2691] at parallelize at PythonRDD.scala:423
However, when I try and use an RDD
I created myself using the following code, it doesn't appear to work anymore:
model = ALS.train(ratings, rank, numIter, lmbda)
validation_data = validation.map(lambda xs: tuple(int(x) for x in xs))
predictions = model.predictAll(validation_data)
print validation_data.take(1)
print predictions.count()
print validation_data
Which outputs:
[(61, 3864)]
0
PythonRDD[4018] at RDD at PythonRDD.scala:43
As you can see, predictAll
comes back empty when passed the mapped RDD
. The values going in are both of the same format. The only noticeable difference that I can see is that the first example uses parallelize and produces a ParallelCollectionRDD
whereas the second example just uses a map which produces a PythonRDD
. Does predictAll
only work if passed a certain type of RDD
? If so, is it possible to convert between RDD
types? I'm not sure how to get this working.