I am using Apache Spark to process a huge amount of data. I need to execute many Spark actions on the same RDD
. My code looks like the following:
val rdd = /* Get the rdd using the SparkContext */
val map1 = rdd.map(/* Some transformation */)
val map2 = map1.map(/* Some other transformation */)
map2.count
val map3 = map2.map(/* More transformation */)
map3.count
The problem is that calling the second action map3.count
forces the re-execution of the transformations rdd.map
and map1.map
.
What the hell is going on? I think the DAG built by Spark is responible of this behaviour.