modifying files on hdfs using mapreduce

Question

Can I modify files which are residing on hdfs? Is the only way to create a temporary file with modified content and drop the original file?

Can I modify a file using map-reduce? Can different blocks of file be modified in parallel and somehow be combined to a single file?

score 1 · Answer 1 · edited May 23 '17 at 10:27

1

You cannot modify a file once it is in HDFS, except by appending to it. See this answer that confirms that append is possible:

Append data to existing file in HDFS Java

Map reduce allows you to operate on a file in parallel, with each mapper reading a block of the file, and many mappers running at once. This is how it is designed to work.

Any given mapper could filter rows and write out all, some or none of them to a new file pretty easily.

If you use map-reduce to write out the modified file, by default it will appear as a directory of files which can be combined into a single file depending on your requirement.

edited May 23 '17 at 10:27

Community

1
1

answered Jan 21 '16 at 17:40

Stephen ODonnell

4,441
17
19

Hi, many thanks for your reply. How do I ensure that the output is serialized when combining the output files into a single file. What I mean is the original file has block 1 data followed by block 2 data and my output file should also have block 1 data followed by block 2 data but with some rows in each of the block filtered out. Is this possible ? – user2783058 Jan 21 '16 at 18:25
In a map reduce job, you will start with 1 file of say 10 blocks. One process will read each block, and if you just want to filter some rows and write out the data, you will end up with 10 files in a directory, each corresponding to your original 10 blocks. The files will be name 00000 to 00010 and I think they will have the same order as you original blocks, but I am not certain on that. If each block can be processed independently, do you care which block is 1st or 2nd? – Stephen ODonnell Jan 22 '16 at 09:30
If ordering is important, run the map-reduce job with a single reducer and sort it - then you will have one file with N blocks and a guaranteed sort order. – Stephen ODonnell Jan 22 '16 at 09:31
Hi, thanks again. Is it guaranteed that part-00000 is from a mapper which operated on first block of the original file ? It can be the case that the mapper operated on 10th block of original file and its output is part-0000 right ? In that case how can I combine the outputs from mappers in sorted order ? – user2783058 Jan 22 '16 at 10:42
I didn't probably understand your suggestion to use reducer to sort. Can you please elaborate ? – user2783058 Jan 22 '16 at 10:44
I think you will need to read up on how map reduce works - I cannot do it justice in a comment here, but there is plenty out there on google already – Stephen ODonnell Jan 22 '16 at 15:45

modifying files on hdfs using mapreduce

1 Answers1