How does EMR handle an s3 bucket for input and output?

Question

I'm spinning up an EMR cluster and I've created the buckets specified in the EMR docs, but how should I upload data and read from it? In my spark submit step I say the script name using s3://myclusterbucket/scripts/script.py Is output not automatically uploaded to s3? How are dependencies handled? I've tried using the pyfiles pointing to a dependency zip inside the s3 bucket, but keep getting back 'file not found'

Your question is very generic. The best way to read the data depends on how big it is and what do you want to do with it. Also, what format it is. The most generic way to move data from and to S3 is with the aws command line tools (https://aws.amazon.com/cli/). With it you can copy your data with something like `aws s3 cp myfile.txt s3://mybucket/myfile.txt`. As for the output, that depends on where your script writes to. — Roberto Congiu, Nov 09 '17 at 22:22
What OP is saying is that AWS EMR supports running spark-submit as a step and has auto terminate on completion. @RobertoCongiu so how do we use `s3 cp src dest` to move the output to s3 automatically on completion. And how do we specify the input s3 folder. Does it overwrite files in s3? Preferably without s3-dist-cp. The objective is to have a one-click automation for getting output for spark in an s3 folder. — devssh, Sep 06 '18 at 08:49
I guess the problem could be solved by adding `s3-dist-cp` as steps before and after the `spark-submit`. Can someone elaborate on how they have done that? — devssh, Sep 06 '18 at 08:56

Yi Ou · Accepted Answer · 2017-11-17T22:41:24.757

MapReduce or Tez jobs in EMR can access S3 directly because of EMRFS (an AWS propriertary Hadoop filesystem implementation based on S3), e.g., in Apache Pig you can do loaded_data = LOAD 's3://mybucket/myfile.txt' USING PigStorage();

Not sure about Python-based Spark jobs. But one solution is to first copy the objects from S3 to the EMR HDFS, and then process them there.

There are multiple ways of doing the copy:

Use hadoop fs commands to copy objects from S3 to the EMR HDFS (and vice versa), e.g., hadoop fs -cp s3://mybucket/myobject hdfs://mypath_on_emr_hdfs
Use s3-dist-cp to copy objects from S3 to the EMR HDFS (and vice versa) http://docs.aws.amazon.com/emr/latest/ReleaseGuide/UsingEMR_s3distcp.html

You can also use awscli (or hadoop fs -copyToLocal) to copy objects from S3 to the EMR master instance local disk (and vice versa), e.g., aws s3 cp s3://mybucket/myobject .

How does EMR handle an s3 bucket for input and output?

1 Answers1

Linked