How to write 'map only' hadoop jobs?

Question

I'm a novice on hadoop, I'm getting familiar to the style of map-reduce programing but now I faced a problem : Sometimes I need only map for a job and I only need the map result directly as output, which means reduce phase is not needed here, how can I achive that?

Check this [Map-only Jobs](http://www.unmeshasreeveni.blogspot.in/2014/05/map-only-jobs-in-hadoop.html) — USB, May 05 '14 at 06:04

Thomas Jungblut · Accepted Answer · 2014-04-02T14:59:07.393

59

This turns off the reducer.

job.setNumReduceTasks(0);

http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/Job.html#setNumReduceTasks(int)

edited Apr 02 '14 at 14:59

answered Feb 22 '12 at 12:48

Thomas Jungblut

20,854
6
68
91

Thank you Thomas, there still remains a problem: after set number of reduce tasks to 0, how to save the map result on hdfs? (I mean how to write map results to files like part-m-*****) – Breakinen Feb 23 '12 at 15:26
Hadoop does this for you, you don't need to care about it. – Thomas Jungblut Feb 23 '12 at 15:31
2

Do we need to specify reduce output key and value in this case ? – Balaji Boggaram Ramanarayan Apr 07 '14 at 19:21

score 9 · Answer 2 · answered Feb 22 '12 at 14:28

9

You can also use the IdentityReducer:

http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/IdentityReducer.html

answered Feb 22 '12 at 14:28

Peter Wippermann

4,125
5
35
48

1

Thank you Peter, I read the source of IdentityReducer, it's really what I meant to do, but are there any method to directly output the map result to HDFS without reduce? (you know the shuffle phase costs lots of bandwidth and cpu/memory resource) – Breakinen Feb 23 '12 at 15:31
IdentityMapper can be used with or without a follow-on reducer. If you use the identity mapper to jump straight thru to the reduce stage you still have the sort-and-shuffle and i/o overhead so using the method mentioned by Thomas is the right way to go if you don't need a reducer. – omnisis Feb 14 '13 at 07:45
3

I'm sorry omnisis, but that's not correct: Setting the number of reduce tasks to zero will omit any sorting. http://stackoverflow.com/questions/10630447/hadoop-difference-between-0-reducer-and-identity-reducer – Peter Wippermann Feb 15 '13 at 10:02

score 5 · Answer 3 · answered Mar 08 '16 at 17:11

5

Can be quite helpful when you need to launch job with mappers only from terminal. You can turn off reducers by specifing 0 reducers in hadoop jar command implicitly:

-D mapred.reduce.tasks=0

So the result command will be following:

hadoop jar myJob.jar -D mapred.reduce.tasks=0 -input myInputDirs -output myOutputDir

To be backward compatible, Hadoop also supports the "-reduce NONE" option, which is equivalent to "-D mapred.reduce.tasks=0".

answered Mar 08 '16 at 17:11

Alex

8,827
3
42
58

1

Now hadoop gives a depreciation warning for -D mapred.reduce.tasks and recommends to use -D mapreduce.job.reduce instead. – Adam Jan 27 '17 at 19:49

score 0 · Answer 4 · answered Jul 01 '18 at 13:12

If you are using oozie as a scheduler to manager your hadoop jobs, then you can just set the property mapred.reduce.tasks(which is the default number of reduce tasks per job) to 0. You can add your mapper in the property mapreduce.map.class, and also there will be no need to add the property mapreduce.reduce.class since reducers are not required.

<configuration>
   <property>
     <name>mapreduce.map.class</name>
     <value>my.com.package.AbcMapper</value>
   </property>
   <property>
     <name>mapred.reduce.tasks</name>
     <value>0</value>
   </property>
   .
   .
   .
<configuration>

How to write 'map only' hadoop jobs?

4 Answers4

Linked