Please help what is the necessity of Shuffle and Sorting in Hadoop?

Question

In a normal wordcount program in mapreduce, do we need to set any method for shuffle and sort, or the framework will take care of this?

Suggest reading http://stackoverflow.com/a/23701182/1586965 – samthebest Aug 08 '14 at 12:27 — samthebest, Aug 08 '14 at 12:27

vefthym · Accepted Answer · 2014-08-08T07:37:20.110

2

The framework will take care of this. Shuffling is the process of transfering data from mappers to reducers, which reduce the data in an ascending (lexicographical) order of their intermediate keys (words).

You can change the default settings, but there is no need to do it in a wordcount program. You just need to set a mapper and a reducer and optionally (but really helps in speed) a combiner.

Even implementing a mapper and a reducer of your own is not necessary, as hadoop comes with such implementations of wordcount mapper (TokenCounterMapper) and reducer (IntSumReducer, which can be also used as a combiner).

edited Aug 08 '14 at 07:37

answered Aug 08 '14 at 07:29

vefthym

7,422
6
32
58

I disagree with "but really helps in speed". You should say it might help to speed it up, but combiners may not run at all, the framework does not guarantee that they will run. – Balduz Aug 08 '14 at 07:45
2

@Balduz of course they will run as soon as something is spilled to disk (which is always the case in word count- unless a split is empty). – Thomas Jungblut Aug 08 '14 at 07:53

Please help what is the necessity of Shuffle and Sorting in Hadoop?

1 Answers1