5

I was trying to understand a MapReduce program. While doing that, I noticed that the reduce tasks start executing almost immediately after all the maps are tasked are finished. Now, this is surprising, because the reduce tasks there work with data that is grouped by key, meaning that there is shuffle/sort step done in between. The only way this could happen is if the shuffling was being done in parallel with mapping.

Secondly, if shuffling is indeed done in parallel with mapping, what is the equivalent of that in Apache Spark? Can mapping and grouping by keys and/or sorting happen in parallel there too?

pythonic
  • 20,589
  • 43
  • 136
  • 219
  • Very short answer (too short for normal answer): you can see shuffles as new stages in Spark's DAG. New stage = new shuffle, probably with few exceptions – T. Gawęda Apr 04 '17 at 20:21
  • for the mapReduce part of the question, you may find this post helpful: http://stackoverflow.com/questions/22141631/what-is-the-purpose-of-shuffling-and-sorting-phase-in-the-reducer-in-map-reduce/22169760#22169760 – vefthym Apr 07 '17 at 14:12

1 Answers1

5

Hadoop's MapReduce is not just map and reduce stages there are additional steps like combiners (map-side reduce) and merge as illustrated below (taken from http://www.bodhtree.com/blog/2012/10/18/ever-wondered-what-happens-between-map-and-reduce/) source: http://www.bodhtree.com/blog/2012/10/18/ever-wondered-what-happens-between-map-and-reduce/ While maps are still running and as they emit keys these keys can be routed and merged and by the time map finished all of the information needed for some reduce buckets may already be processed and ready for reduce.

Spark builds a DAG (direct acyclic graph) of the phases needed to process and groups them into stages where data needs to be shuffled between nodes. Unlike Hadoop where the data is pushed during map, spark reducers pull data and thus only do that when they begin to run (on the other hand Spark tries to run more in memory (vs. disk) and working with a DAG, handles iterative processing better)

Alexey Grishchenko has a good explanation of Spark Shuffle here (note that as of Spark 2 only sort shuffle exists)

Arnon Rotem-Gal-Oz
  • 25,469
  • 3
  • 45
  • 68