I have a need to send only selected records from mapper to reducer and rest filter record to write to hdfs from mapper itself. Reducer will write the records send to reducer. My job is processing huge data in 20TBs, it uses 30K mappers, so I believe I cannot write from mapper's cleanup method as well, because to load that data from 30K mapper's output files(30k files) will be a another problem for the next job. I am using CDH4. Has anyone implemented a similar scenario with any other different approach?
Asked
Active
Viewed 287 times
0
-
A very interesting question! (+1). I once had this problem and didn't find anything more than just send the records of the mapper to the reducer, too, and write everything from the reducer (after filtering which records need further processing). Of course, that was very inefficient, instead of writing things straight from the mapper – vefthym Jul 31 '15 at 07:52
1 Answers
0
When you want to write the data to HDFS, is it through java client and to HDFS? If yes, then you can write conditional logic to write to HDFS and write to output location, from where reducer picks up. Records not meeting the condition can then use mapper to write to output location, and later be picked up by reducer. By default the output location is also a HDFS location, but you have to see which way you want the data to be in HDFS as per your case.

Ramzy
- 6,948
- 6
- 18
- 30
-
Thanks for your response. I am implementing a map side join with a lot custom formulas and need is to go via MR program. My problem is that when I write from mapper, it creates many files (equal to no of mappers in job and each mapper created its own file, in my case 40 mappers). which is a problem to next successor job to read those files(40K files in my case). is there a way I can combine these files or o/p data from from mapper in such a way that I can control no of files written from mapper. – Akhtar Aug 03 '15 at 07:05
-
Why not give the directory name, for the subsequent jobs. The number of output files, should not be a problem, unless you have a requirement or so. – Ramzy Aug 03 '15 at 07:17
-
if you have more no of small files(40k+), mostly hadoop struggles to keep paths and other meta info in memory for all read operations. for me it hangs CDH4 cluster, when my next job trigger and try to load these 40K files. – Akhtar Aug 03 '15 at 07:44
-
You can look [here](http://stackoverflow.com/questions/3548259/merging-multiple-files-into-one-within-hadoop) for merging. Mostly **getmerge** should do the trick for you. Once you have a single file, its up to you. Take care that you are merging it to local m/c – Ramzy Aug 03 '15 at 08:02