Process Mutiple Input Files In MapReduce separately

Question

I am working on Map Reduce project "like the Word count example" with some changes, In my case I have many files to be process if I run the program, I want each map to take one of the files and process it separate from others "I want the output for a file independent from other files output"

I try to use the:

Path filesPath = new Path("file1.txt,file2.txt,file3.txt");

MultipleInputs.addInputPath(job, filesPath, TextInputFormat.class, Map.class);

but the output I got is mixing all the files output together, and if a word appear in more than file, it processed once, and that what I don't want. I want the word count in each file separate.

So how can i use this?

if I put the files in a directory is it will process independent?

score 0 · Answer 1 · answered Feb 07 '17 at 18:19

This is the way Hadoop's map-reduce works. All files are merged together, sorted and by key and all records with the same key are fed to the mappers.

If you want one mapper to see only one file, you have to run one job per file, and also you have to force configuration to have only one mapper per job.

score 0 · Answer 2 · edited May 23 '17 at 11:46

0

Within the Map task you will be able to get the file name for the record which is being processed.

Get File Name in Mapper

Once you have the file name you can add that to the Map output key, form a composite key, and implement a grouping comparator to group keys from same file into one reducer.

edited May 23 '17 at 11:46

Community

1
1

answered Feb 07 '17 at 18:27

Venkat

1,810
1
11
14

Great answer, In my case I wanna send a title for each file with its content, so I can add the title as the file name like what you say. I will try now the grouping comparator and I hope it work. Thx – user5532529 Feb 07 '17 at 19:19

Process Mutiple Input Files In MapReduce separately

2 Answers2