How do I write the output of an EMR streaming job to HDFS?

Question

I see examples of people writing EMR output to HDFS, but I haven't been able to find examples of how it's done. On top of that, this documentation seems to say that the --output parameter for an EMR streaming job must be an S3 bucket.

When I actually try to run a script (in this case, using python streaming and mrJob), it throws an "Invalid S3 URI" error.

Here's the command:

python my_script.py -r emr \
 --emr-job-flow-id=j-JOBID --conf-path=./mrjob.conf --no-output \
 --output hdfs:///my-output \
 hdfs:///my-input-directory/my-files*.gz

And the traceback...

Traceback (most recent call last):
  File "pipes/sampler.py", line 28, in <module>
    SamplerJob.run()
  File "/Library/Python/2.7/site-packages/mrjob/job.py", line 483, in run
    mr_job.execute()
  File "/Library/Python/2.7/site-packages/mrjob/job.py", line 501, in execute
    super(MRJob, self).execute()
  File "/Library/Python/2.7/site-packages/mrjob/launch.py", line 146, in execute
    self.run_job()
  File "/Library/Python/2.7/site-packages/mrjob/launch.py", line 206, in run_job
    with self.make_runner() as runner:
  File "/Library/Python/2.7/site-packages/mrjob/job.py", line 524, in make_runner
    return super(MRJob, self).make_runner()
  File "/Library/Python/2.7/site-packages/mrjob/launch.py", line 161, in make_runner
    return EMRJobRunner(**self.emr_job_runner_kwargs())
  File "/Library/Python/2.7/site-packages/mrjob/emr.py", line 585, in __init__
    self._output_dir = self._check_and_fix_s3_dir(self._output_dir)
  File "/Library/Python/2.7/site-packages/mrjob/emr.py", line 776, in _check_and_fix_s3_dir
    raise ValueError('Invalid S3 URI: %r' % s3_uri)
ValueError: Invalid S3 URI: 'hdfs:///input/sample'

How can I write the output of an EMR streaming job to HDFS? Is it even possible?

This is an old issue but probably still active. By looking at MrJob sources, EMRJobRunner only accepts S3 buckets at output destination. As you are using a "long lived" cluster, *maybe* is there a solution by using an HadoopJobRunner instead (`-r hadoop`). I wasn't able to achieve a working solution though... — Sylvain Leroux, Mar 03 '16 at 14:09

score 1 · Answer 1 · edited Mar 28 '17 at 17:45

I am not sure how it can be done using mrJob, but with hadoop and streaming jobs written in java, we do it as follows:

Launch the cluster
Get the data from s3 using s3distcp to HDFS of the cluster
Execute the step 1 of our job with input as HDFS
Execute the step 2 or our job with the same input as above ...

Using the EMR CLI, we do it as follows:

> export jobflow=$(elastic-mapreduce --create --alive --plain-output
> --master-instance-type m1.small --slave-instance-type m1.xlarge --num-instances 21 --name "Custer Name" --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop --args
> "--mapred-config-file,s3://myBucket/conf/custom-mapred-config-file.xml")
> 
> 
> elastic-mapreduce -j $jobflow --jar
> s3://us-east-1.elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar
> --arg --src --arg 's3://myBucket/input/' --arg --dest --arg 'hdfs:///input'
> 
> elastic-mapreduce --jobflow $jobflow --jar s3://myBucket/bin/step1.jar
> --arg hdfs:///input --arg hdfs:///output-step1 --step-name "Step 1"
> 
> elastic-mapreduce --jobflow $jobflow --jar s3://myBucket/bin/step2.jar
> --arg hdfs:///input,hdfs:///output-step1 --arg s3://myBucket/output/ --step-name "Step 2"

kgu87 · Answer 2 · 2013-05-25T00:24:22.257

0

It must be an S3 bucket because EMR cluster would not persist normally after the job is done. So, the only way to persist the output is outside the cluster, and the next closest place is S3.

edited May 25 '13 at 00:24

answered May 25 '13 at 00:15

kgu87

2,050
14
12

I'm running the jobflow in "keep-alive" mode, so results can persist in HDFS between jobflow steps. The structure of my jobs requires using the same (large) data sets as an input to many steps in the flow. It would save a lot of time if the data were stored in HDFS, instead of re-downloading it from S3 at every step. – Abe May 25 '13 at 20:24
I see. I am no python expert, but the code for MRJobRunner (super of EMRJobRunner) code seems to suggest that you do not need to specify 'hdfs://' as part of output parameter, just the location - https://github.com/Yelp/mrjob/blob/master/mrjob/emr.py – kgu87 May 25 '13 at 20:49

score 0 · Answer 3 · answered May 22 '15 at 22:12

0

Saving the output of a MRJob EMR job is currently not possible. There is currently an open freature request for this at https://github.com/Yelp/mrjob/issues/887 .

answered May 22 '15 at 22:12

user4930682

1

How do I write the output of an EMR streaming job to HDFS?

3 Answers3