3

I know how the map reduces works and what steps I have:

  • Mapping
  • Shuffle and sorting
  • Reducing

Off course I have Partitioning, Combiners but that's not important right now.

The interesting is that when I run map reduce jobs, looks like mappers and reducers work in parallel:

enter image description here

So I don't understand how it is possible.

Question 1. If I have multiple nodes that are doing mapping operation, how reducer can start working? Because Reducer can't start working without sorting right? (The input must be sorted for Reducer - if the mapper is still working, input can't be sorted).

Question 2. If I have multiple reducers, how the final data will be merged together? In other words, final results should be sorted right? It means we spend additional O( n*Log n) time to merge "multiple reducer results?"

grep
  • 5,465
  • 12
  • 60
  • 112
  • Regarding your second question, why do you expect the results be sorted? – LoMaPh May 30 '19 at 00:07
  • if multiple reducer finished the task , they may have the same keys in the result, right? So the final result should be merged. For example if I'm trying to find "average salary" and I have 2 reducer, finally I must merge the result - otherwise I will have different results in different reducer output. In this case I should sort to find the similar keys (to group similar keys). – grep May 30 '19 at 16:44
  • 1
    The reducer that handles each key is unique: "It is necessary that for any key, regardless of which mapper instance generated it, the destination partition is the same. If the key "cat" is generated in two separate (key, value) pairs, they must both be reduced together." [source](https://developer.yahoo.com/hadoop/tutorial/module5.html#partitioning) – LoMaPh May 30 '19 at 20:03

1 Answers1

2

Reducers can start copying results from mappers as soon as they become available. It is called copy phase of the reduce task (see Hadoop the Definitive Guide, Chapter 7 How MapReduce Works).
Also from there:

...When all the map outputs have been copied, the reduce task moves into the sort phase (which should properly be called the merge phase, as the sorting was carried out on the map side), which merges the map outputs, maintaining their sort ordering. This is done in rounds...

mazaneicha
  • 8,794
  • 4
  • 33
  • 52
  • And of course, remember that Hadoop is also an MPP, which means, it can handle them in a massive parallel process. No offensive, but it's like a funny question. Because if Hadoop doesn't do this, I guess it would not so useful as it is for companies. – Kenry Sanchez May 31 '19 at 01:56
  • @KenrySanchez Not sure why would this question be funny, maybe you didnt follow it completely. On a separate note, I believe Hadoop is a classical example of cluster architecture rather than MPP. If you find it confusing, please check https://stackoverflow.com/questions/5570936/what-is-the-difference-between-a-cluster-and-mpp-supercomputer-architecture for example. – mazaneicha Jun 02 '19 at 14:38
  • no worry, it's without no offensive ;). I understand the Hadoop architecture. But, as I told them before. In a professional company environment, You need to process a large huge of data and even that Hadoop must work in a cluster (even with 1 master and two slayers). The real power comes with MPP. The cluster is just a way to update the production power, in this case, in a horizontally scalable way. But in fact, you are right, it's a classical example of cluster architecture – Kenry Sanchez Jun 02 '19 at 18:02