1

My job is computational intensive so I am actually only using the distribution function of Hadoop, and I want all my output to be in 1 single file so I have set the number of reducer to 1. My reducer is actually doing nothing...

By explicitly setting the number of reducer to 0, may I know how can I control in the mapper to force all the outputs are written into the same 1 output file? Thanks.

Kevin
  • 2,191
  • 9
  • 35
  • 49

2 Answers2

1

You can't do that in Hadoop. Your mappers each have to write to independent files. This makes them efficient (no contention or network transfer). If you want to combine all those files, you need a single reducer. Alternatively, you can let them be separate files, and combine the files when you download them (e.g., using HDFS's command-line cat or getmerge options).

EDIT: From your comment, I see that what you want is to get away with the hassle of writing a reducer. This is definitely possible. To do this, you can use the IdentityReducer. You can check its API here and an explanation of 0 reducers vs. using the IdentityReducer is available here.

Finally, when I say that having multiple mappers generate a single output is not possible, I mean it is not possible with plain files in HDFS. You could do this with other types of output, like having all mappers write to a single database. This is OK if your mappers are not generating much output. Details on how this would work are available here.

Community
  • 1
  • 1
cabad
  • 4,555
  • 1
  • 20
  • 33
  • that's my understanding... just wondering if there is any hidden indicator/param to perform that. So the best way is to keep the reducer there. :/ – Kevin Oct 31 '13 at 15:13
  • @kevin. No, and there can't be, since it would kill your performance. – cabad Oct 31 '13 at 15:17
  • when I say hidden param is like I don't need to create the reducer class but hadoop is smart enough to "reduce" all the output to a file. Obviously I was thinking too much. lol – Kevin Oct 31 '13 at 15:19
  • @kevin You could use one of the preexisting reducers. I'll update my answer with this and another suggestion – cabad Oct 31 '13 at 15:21
  • 1
    FYI, I added an answer that makes one additional suggestion. – John B Oct 31 '13 at 15:38
0

cabad is correct for the most part. However, if you want to process the file with a single Mapper to a single output file you could use a FileInputFormat that marks the file as not splittable. Do this as well as set the number of Reducers to 0. This reduces the performance of using multiple data nodes but skips Shuffle and Sort.

John B
  • 32,493
  • 6
  • 77
  • 98
  • Yes, a single mapper works too. I didn't suggest it since he said his tasks are CPU intensive, so I am guessing a single mapper would kill his performance. I should, however, have included this alternative for the sake of completeness (future reference). Thanks for pointing it out. – cabad Oct 31 '13 at 15:53