inputRDD = sc.textFile("log.txt")
errorsRDD = inputRDD.filter(lambda x: "error" in x)
warningsRDD = inputRDD.filter(lambda x: "warning" in x)
badLinesRDD = errorsRDD.union(warningsRDD)
badLinesCount = badLinesRDD.count()
warningCount = warningsRDD.count()
In the code above none of the transformations are evaluated until the second to last line of code is executed where you count the number of objects in the badLinesRDD. So when this badLinesRDD.count()
is run it will compute the first four RDD's up till the union and return you the result. But when warningsRDD.count()
is run it will only compute the transformation RDD's until the top 3 lines and return you a result correct?
Also when these RDD transformations are computed when an action is called on them where are the objects from the last RDD transformation, which is union, stored? Does it get stored in memory on the each of the DataNodes where the filter transformation was run in parallel?