How to generate tuples of (original label, predicted label) on Spark with MLlib?

Question

I am trying to make predictions with the model that I got back from MLlib on Spark. The goal is to generate tuples of (orinalLabelInData, predictedLabel). Then those tuples can be used for model evaluation purpose. What is the best way to achieve this? Thanks.

Assuming parsedTrainData is a RDD of LabeledPoint

from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.tree import DecisionTree, DecisionTreeModel
from pyspark.mllib.util import MLUtils

parsedTrainData = sc.parallelize([LabeledPoint(1.0, [11.0,-12.0,23.0]), 
                                  LabeledPoint(3.0, [-1.0,12.0,-23.0])])

model = DecisionTree.trainClassifier(parsedTrainData, numClasses=7,
categoricalFeaturesInfo={}, impurity='gini', maxDepth=8, maxBins=32)

model.predict(parsedTrainData.map(lambda x: x.features)).take(1)

This gives back the predictions, but I am not sure how to match each prediction back to the original labels in data.

I tried

parsedTrainData.map(lambda x: (x.label, dtModel.predict(x.features))).take(1)

however, it seems like my way of sending model to worker is not a valid thing to do here

/spark140/python/pyspark/context.pyc in __getnewargs__(self)
    250         # This method is called when attempting to pickle SparkContext, which is always an error:
    251         raise Exception(
--> 252             "It appears that you are attempting to reference SparkContext from a broadcast "
    253             "variable, action, or transforamtion. SparkContext can only be used on the driver, "
    254             "not in code that it run on workers. For more information, see SPARK-5063."

Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.

It would be useful if tell us what is the class of the `model`. — zero323, Jul 28 '15 at 15:50
@zero323 Thanks zero323. I updated the question accordingly. — Bin, Jul 28 '15 at 16:51
I also changed the post, so that there is a python code snippet that will reproduce the Exception. — Bin, Jul 28 '15 at 17:09

score 3 · Accepted Answer · answered Jul 28 '15 at 17:26

3

Well, according to the official documentation you can simply zip predictions and labels like this:

predictions = model.predict(parsedTrainData.map(lambda x: x.features))
labelsAndPredictions = parsedTrainData.map(lambda x: x.label).zip(predictions)

answered Jul 28 '15 at 17:26

zero323

322,348
103
959
935

Thanks a lot! zero. I missed it. It works great on my data. I was concerned that the distributed data will return in random order and won't match up. From this great resource you pointed out. It seems like Spark handles it. May be because it send all the transformations at once to worker node when a action triggers computation, so the data will zip up correctly on each worker node? – Bin Jul 28 '15 at 17:41
1

Well, since it is map-only operation and data doesn't grow there is no need for shuffling and order should be preserved. Still, I have to admit I am little bit confused. Scala version of the `DecisionTreeModel` can handle equivalent of `parsedTrainData.map(lambda x: (x.label, dtModel.predict(x.features)))` so it seems there is something different in an execution strategy, but I couldn't pinpoint the exact problem in PySpark source yet. – zero323 Jul 28 '15 at 17:47
1

@Bin If you're interested I posted some explanation here and a follow-up question here: http://stackoverflow.com/q/31684842/1560062 – zero323 Jul 28 '15 at 18:56
Cool! Thanks for keep researching. We will get better understanding on this matter through your new question. I will keep tracking your new post. – Bin Jul 28 '15 at 19:14

How to generate tuples of (original label, predicted label) on Spark with MLlib?

1 Answers1

Linked