7

I'm writing a map function using mrjob. My input will come from files in a directory on HDFS. Names of the files contain a small but crucial piece information that is not present in the files. Is there a way to learn (inside a map function) the name of the input file from which a given key-value pair comes?

I'm looking for an equivalent of this Java code:

FileSplit fileSplit = (FileSplit)reporter.getInputSplit();
String fileName = fileSplit.getPath().getName();

Thanks in advance!

Bolo
  • 11,542
  • 7
  • 41
  • 60

2 Answers2

6

map.input.file property will give the input file name.

According to the Hadoop - The Definitive Guide

The properties can be accessed from the job’s configuration, obtained in the old MapReduce API by providing an implementation of the configure() method for Mapper or Reducer, where the configuration is passed in as an argument. In the new API, these properties can be accessed from the context object passed to all methods of the Mapper or Reducer.

Praveen Sripati
  • 32,799
  • 16
  • 80
  • 117
  • 1
    And more information can be found from Praveen's previous answer to a question similar to this - http://stackoverflow.com/questions/7449756/get-input-file-name-in-streaming-hadoop-program – Chris White Jul 11 '12 at 18:11
  • 4
    Thanks, @PraveenSripati and @ChrisWhite, this is exactly what I needed! To state it explicitly for future visitors: `fileName = os.environ['map_input_file']` does the trick. – Bolo Jul 11 '12 at 21:39
6

If you are using HADOOP 2.x with Python:

file_name = os.environ['mapreduce_map_input_file']
Boggio
  • 1,128
  • 11
  • 16
  • Are these listed somewhere online or do I have to browse through the source code to find them?! – masu Sep 12 '14 at 01:18
  • @masu, I think these properties are automatically set by Hadoop Streaming framework: https://stackoverflow.com/a/7452439/1389089 http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#Configured+Parameters – shapiy May 08 '19 at 10:11