8

I'm coming from a Hadoop background and have limited knowledge about Spark. BAsed on what I learn so far, Spark doesn't have mapper/reducer nodes and instead it has driver/worker nodes. The worker are similar to the mapper and driver is (somehow) similar to reducer. As there is only one driver program, there will be one reducer. If so, how simple programs like word count for very big data sets can get done in spark? Because driver can simply run out of memory.

HHH
  • 6,085
  • 20
  • 92
  • 164

1 Answers1

15

The driver is more of a controller of the work, only pulling data back if the operator calls for it. If the operator you're working on returns an RDD/DataFrame/Unit, then the data remains distributed. If it returns a native type then it will indeed pull all of the data back.

Otherwise, the concept of map and reduce are a bit obsolete here (from a type of work persopective). The only thing that really matters is whether the operation requires a data shuffle or not. You can see the points of shuffle by the stage splits either in the UI or via a toDebugString (where each indentation level is a shuffle).

All that being said, for a vague understanding, you can equate anything that requires a shuffle to a reducer. Otherwise it's a mapper.

Last, to equate to your word count example:

sc.textFile(path)
  .flatMap(_.split(" "))
  .map((_, 1))
  .reduceByKey(_+_)

In the above, this will be done in one stage as the data loading (textFile), splitting(flatMap), and mapping can all be done independent of the rest of the data. No shuffle is needed until the reduceByKey is called as it will need to combine all of the data to perform the operation...HOWEVER, this operation has to be associative for a reason. Each node will perform the operation defined in reduceByKey locally, only merging the final data set after. This reduces both memory and network overhead.

NOTE that reduceByKey returns an RDD and is thus a transformation, so the data is shuffled via a HashPartitioner. All of the data does NOT pull back to the driver, it merely moves to nodes that have the same keys so that it can have its final value merged.

Now, if you use an action such as reduce or worse yet, collect, then you will NOT get an RDD back which means the data pulls back to the driver and you will need room for it.

Here is my fuller explanation of reduceByKey if you want more. Or how this breaks down in something like combineByKey

Justin Pihony
  • 66,056
  • 18
  • 147
  • 180
  • The assosiative propertry brings in a functionality like combiner in HAdoop. However, for the'reduceByKey' the data needs to get shuffled and be sent to another worker node? Can we say that worker node plays the reducer role? – HHH Jul 31 '15 at 18:58
  • So that mean my first understanding was right, because in the case of word count, let's say we have 1M unique words. So all those unique words (along with their partial frequency) get transferred to the driver in order to get reduced? In other words, we can conclude that there is one reducer in Spark and that is the driver? All those reductions are also done in Hadoop (using combiner). – HHH Jul 31 '15 at 19:08
  • Blah, I somehow blanked that reduceByKey is a transformation....the driver will NOT get all the data in this case...this is why pair functions are pushed when you have a key/value representation...I updated my answer to specify that...but yes, the workers are the reducers in that role. – Justin Pihony Jul 31 '15 at 19:24