0

I'm trying to understand how (an if) the piece of code below works. In particular, what I don't understand is WHY does this code ASSUME -maybe correctly- that the order of elements in the RDD is preserved subsequent to mappings. This is in essence an example of the same question asked here Mind blown: RDD.zip() method. I don't understand why/how the last line quarantees that the zip actually zips the correct prediction with the corresponding label from the testData RDD? One of the comments mentions that if the RDD, testData in this case, is ordered in some way, then map will preserve that order. However, predictions is an entirely different RDD.. I can't see how or why this works!!

from pyspark.mllib.tree import RandomForest
from pyspark.mllib.util import MLUtils
## Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = labeledDataRDD.randomSplit([0.7, 0.3])
## Train a RandomForest model
model = RandomForest.trainClassifier(trainingData, numClasses=2510,
                     categoricalFeaturesInfo={},numTrees=100,
                     featureSubsetStrategy="auto",
                     impurity='gini', maxDepth=4, maxBins=32)

# Evaluate model on test instances and compute test error
predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
Community
  • 1
  • 1
Kai
  • 1,464
  • 4
  • 18
  • 31
  • 1
    It works because, ignoring building classifier, every transformation which is applied here is expressed using `mapPartitionsWithIndex`. It is a local operation which requires no shuffling so partition membership is preserved and applies function over Iterator so order of values per partition is preserved as well. At the higher level the only transformation which is applied is map which preserves order by contract. – zero323 Oct 26 '15 at 20:12
  • @zero323, it does make sense that model.predict will require no shuffle since to make a prediction you only need the feature vector, but **predictions** is a separate RDD altogether, even if it is created from testData. Being new to Spark, I think I need to understand the details of how this works. Can you point to some reading material that would help understand the details you mention; map is implemented using a mapPartitionsWithIndex?? – Kai Oct 26 '15 at 21:23

0 Answers0