Hadoop InputFormat set Key to Input File Path

Question

My hadoop job needs to be aware of the input path that each record is derived from.

For example assume I am running a job over a collection of S3 objects:

s3://bucket/file1
s3://bucket/file2
s3://bucket/file3

I would like to reduce key value pairs such as

s3://bucket/file1    record1
s3://bucket/file1    record2
s3://bucket/file2    record1
...

Is there an extension of org.apache.hadoop.mapreduce.InputFormat that would accomplish this? Or is there a better way to go about this than using a custom input format?

I know that in a mapper this information is accessible from the MapContext (How to get the input file name in the mapper in a Hadoop program?) but I am using Apache Crunch and I cannot control whether any of my steps will be Maps or Reduces, however I can reliably control the InputFormat so it seemed to me to be the place to do this.

score 1 · Answer 1 · edited Aug 15 '17 at 02:51

1

Please have a look at my blog article to customize inputsplit and recordreader.

The code in that blog sets key as below (Line 69-70 of recordreader code)

value = new Text(line);
key = new LongWritable(splitstart);

In your case you need to set key as below, I didn't test it though.

key = fsplit.getPath().toString();

edited Aug 15 '17 at 02:51

Nathan Tuggy

2,237
27
30
38

answered Aug 15 '17 at 02:11

Kamal

51
4

Thanks, I think I ended up writing my own as you describe. If someone is reading his ping me and ill dig up and post. – qwwqwwq Aug 15 '17 at 20:26

Hadoop InputFormat set Key to Input File Path

1 Answers1