1

I am using Apache Spark to process a huge amount of data. I need to execute many Spark actions on the same RDD. My code looks like the following:

val rdd = /* Get the rdd using the SparkContext */
val map1 = rdd.map(/* Some transformation */)
val map2 = map1.map(/* Some other transformation */)
map2.count
val map3 = map2.map(/* More transformation */)
map3.count

The problem is that calling the second action map3.count forces the re-execution of the transformations rdd.map and map1.map.

What the hell is going on? I think the DAG built by Spark is responible of this behaviour.

zero323
  • 322,348
  • 103
  • 959
  • 935
riccardo.cardin
  • 7,971
  • 5
  • 57
  • 106
  • 2
    Do you have a minimal working example to reproduce that behavior? I tried something obvious with `println` in `map` to show which operation is being performed, but when I call `collect` on the second one, I only get the second, not the first also. – Robert Dodier Jan 14 '16 at 18:11
  • I have corrected my question, that was not so accurate. Tomorrow I will be able to give a working example. Thanks for your help. – riccardo.cardin Jan 14 '16 at 21:55

1 Answers1

4

This is an expected behavior. Unless one of the ancestor can be fetched from cache (typically it means that is has been persisted explicitly or implicitly during shuffle) every action will recompute a whole lineage.

Recomputation can be also triggered if RDD has been persisted but data has been lost / removed from cache or amount of available space is to low to store all records.

In this particular case you should cache in a following order

...
val map2 = map1.map(/* Some other transformation */)
map2.cache
map2.count
val map3 = map2.map(/* More transformation */)
...

if you want to avoid repeated evaluation of rdd, map1 and map2.

zero323
  • 322,348
  • 103
  • 959
  • 935
  • I'm facing the same problem with my spark job, but in my case data is huge so caching is probably not possible. Is there any other option so that I can avoid data reshuffling. Thanks in Advance – Atul Singh Rajpoot Jun 01 '17 at 14:46
  • @Atul Shuffled data will reuse shuffle files. See [What does “Stage Skipped” mean in Apache Spark web UI?](https://stackoverflow.com/q/34580662/1560062) and other questions which are linked to it. Also caching with memory and disk should be possible in general. – zero323 Jun 01 '17 at 15:38