4

I am streaming an R mapreduce job and I am need to get the filename. I know that Hadoop sets environment variables for the current job before it starts and I can access env vars in R with Sys.getenv().

I found : Get input file name in streaming hadoop program

and Sys.getenv(mapred_job_id) works fine, but it is not what I need. I just need the filename and not the job id or name. I also found: How to get filename when running mapreduce job on EC2?

But this isn't helpful either. What is the easiest way to get the current filename while streaming from R? Thank you

Community
  • 1
  • 1
Jason
  • 75
  • 1
  • 5

1 Answers1

6

I have not tried this, but from the second link you provided, it seems that this is available in an environment variable called map.input.file. Then, this should work:

Sys.getenv("map.input.file")

EDIT: Upon further investigation, I learned that you need to replace the dots with underscores, so this is the way to do it:

Sys.getenv("map_input_file")

However, the map.input.file property has been deprecated in YARN (Hadoop 2.x), so the new name should be used instead:

Sys.getenv("mapreduce_map_input_file")
cabad
  • 4,555
  • 1
  • 20
  • 33
  • 1
    It doesn't work and I think it is because that post is using an elastic mapreduce from Amazon which has additional support. I also tried "map_input_file" and it had the exact same result. – Jason Jan 04 '14 at 01:45
  • @jason Are you using YARN? If so, you should use the new property name. See my updated answer. – cabad Jan 04 '14 at 02:26
  • 2
    I figured it out. "map_input_file" was working correctly, but its returning the full path and that is causing things to break and that's why my job is crashing. Thanks for your help. – Jason Jan 04 '14 at 03:03