I'm trying to understand how (an if) the piece of code below works. In particular, what I don't understand is WHY does this code ASSUME -maybe correctly- that the order of elements in the RDD is preserved subsequent to mappings. This is in essence an example of the same question asked here Mind blown: RDD.zip() method. I don't understand why/how the last line quarantees that the zip actually zips the correct prediction with the corresponding label from the testData RDD? One of the comments mentions that if the RDD, testData in this case, is ordered in some way, then map will preserve that order. However, predictions is an entirely different RDD.. I can't see how or why this works!!
from pyspark.mllib.tree import RandomForest
from pyspark.mllib.util import MLUtils
## Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = labeledDataRDD.randomSplit([0.7, 0.3])
## Train a RandomForest model
model = RandomForest.trainClassifier(trainingData, numClasses=2510,
categoricalFeaturesInfo={},numTrees=100,
featureSubsetStrategy="auto",
impurity='gini', maxDepth=4, maxBins=32)
# Evaluate model on test instances and compute test error
predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)