26

I am just trying to confirm my understanding of difference between 0 reducer and identity reducer.

  • 0 reducer means reduce step will be skipped and mapper output will be the final out
  • Identity reducer means then shuffling/sorting will still take place?
kee
  • 10,969
  • 24
  • 107
  • 168

4 Answers4

39

You understanding is correct. I would define it as following: If you do not need sorting of map results - you set 0 reduced,and the job is called map only.
If you need to sort the mapping results, but do not need any aggregation - you choose identity reducer.
And to complete the picture we have a third case : we do need aggregation and, in this case we need reducer.

David Gruzman
  • 7,900
  • 1
  • 28
  • 30
5

Another use-case for using the Identity Reducer is to combine all the results into <# of reducers> output files. This can be handy if you are using Amazon Web Services to write to S3 directly, especially if the mapper output is small (e.g. a grep/search for a record), and you have a lot of mappers (e.g. 1000's).

Dolan Antenucci
  • 15,432
  • 17
  • 74
  • 100
  • Hi Dolan, could you elaborate a bit about using Identity Reducer to combine results into fewer files? I was facing similar problems -- having lots of small files generated by map-only jobs. Would it be less efficient compared to map-only jobs? – Yitong Zhou Sep 19 '14 at 18:58
  • Yitong -- there is additional overhead when using the Identity Reducers over none at all because the Mapper outputs need to be hashed into X buckets and then sent to the X reducers (i.e. where X is your desired number of output files), sorted, and then saved to the output directory on HDFS/S3/etc. If you have a ton of data, then you'll need to be careful with this additional overhead because it can be significant in some cases. Alteratively, if saving to HDFS, you can use `hdfs cat` to stream all the files' output into one location. I don't know if S3 has a similar stream-reading mechanism. – Dolan Antenucci Sep 20 '14 at 11:10
4

The main difference between "No Reducer" (mapred.reduce.tasks=0) and "Standard reducer" which is IdentityReducer (mapred.reduce.tasks=1 etc) is when you use "No reducer" there is no partitioning&shuffling processes after MAP stage. Therefore, in this case you will get 'pure' output from your mappers without any further processing. It helps for development and debugging puproses, but not only.

morsik
  • 1,250
  • 14
  • 17
3

It depends on your business requirements. If you are doing a wordcount you should reduce your map output to get a total result. If you just want to change the words to upper case, you don't need a reduce.

Stephen Holiday
  • 715
  • 5
  • 11
nice2mu
  • 31
  • 3