My hadoop job needs to be aware of the input path that each record is derived from.
For example assume I am running a job over a collection of S3 objects:
s3://bucket/file1
s3://bucket/file2
s3://bucket/file3
I would like to reduce key value pairs such as
s3://bucket/file1 record1
s3://bucket/file1 record2
s3://bucket/file2 record1
...
Is there an extension of org.apache.hadoop.mapreduce.InputFormat
that would accomplish this? Or is there a better way to go about this than using a custom input format?
I know that in a mapper this information is accessible from the MapContext
(How to get the input file name in the mapper in a Hadoop program?) but I am using Apache Crunch and I cannot control whether any of my steps will be Maps or Reduces, however I can reliably control the InputFormat so it seemed to me to be the place to do this.