2

I am trying to make predictions with the model that I got back from MLlib on Spark. The goal is to generate tuples of (orinalLabelInData, predictedLabel). Then those tuples can be used for model evaluation purpose. What is the best way to achieve this? Thanks.

Assuming parsedTrainData is a RDD of LabeledPoint

from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.tree import DecisionTree, DecisionTreeModel
from pyspark.mllib.util import MLUtils

parsedTrainData = sc.parallelize([LabeledPoint(1.0, [11.0,-12.0,23.0]), 
                                  LabeledPoint(3.0, [-1.0,12.0,-23.0])])

model = DecisionTree.trainClassifier(parsedTrainData, numClasses=7,
categoricalFeaturesInfo={}, impurity='gini', maxDepth=8, maxBins=32)

model.predict(parsedTrainData.map(lambda x: x.features)).take(1)

This gives back the predictions, but I am not sure how to match each prediction back to the original labels in data.

I tried

parsedTrainData.map(lambda x: (x.label, dtModel.predict(x.features))).take(1)

however, it seems like my way of sending model to worker is not a valid thing to do here

/spark140/python/pyspark/context.pyc in __getnewargs__(self)
    250         # This method is called when attempting to pickle SparkContext, which is always an error:
    251         raise Exception(
--> 252             "It appears that you are attempting to reference SparkContext from a broadcast "
    253             "variable, action, or transforamtion. SparkContext can only be used on the driver, "
    254             "not in code that it run on workers. For more information, see SPARK-5063."

Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063. 
Community
  • 1
  • 1
Bin
  • 3,645
  • 10
  • 33
  • 57

1 Answers1

3

Well, according to the official documentation you can simply zip predictions and labels like this:

predictions = model.predict(parsedTrainData.map(lambda x: x.features))
labelsAndPredictions = parsedTrainData.map(lambda x: x.label).zip(predictions)
zero323
  • 322,348
  • 103
  • 959
  • 935
  • Thanks a lot! zero. I missed it. It works great on my data. I was concerned that the distributed data will return in random order and won't match up. From this great resource you pointed out. It seems like Spark handles it. May be because it send all the transformations at once to worker node when a action triggers computation, so the data will zip up correctly on each worker node? – Bin Jul 28 '15 at 17:41
  • 1
    Well, since it is map-only operation and data doesn't grow there is no need for shuffling and order should be preserved. Still, I have to admit I am little bit confused. Scala version of the `DecisionTreeModel` can handle equivalent of `parsedTrainData.map(lambda x: (x.label, dtModel.predict(x.features)))` so it seems there is something different in an execution strategy, but I couldn't pinpoint the exact problem in PySpark source yet. – zero323 Jul 28 '15 at 17:47
  • 1
    @Bin If you're interested I posted some explanation here and a follow-up question here: http://stackoverflow.com/q/31684842/1560062 – zero323 Jul 28 '15 at 18:56
  • Cool! Thanks for keep researching. We will get better understanding on this matter through your new question. I will keep tracking your new post. – Bin Jul 28 '15 at 19:14