3

I'm spinning up an EMR cluster and I've created the buckets specified in the EMR docs, but how should I upload data and read from it? In my spark submit step I say the script name using s3://myclusterbucket/scripts/script.py Is output not automatically uploaded to s3? How are dependencies handled? I've tried using the pyfiles pointing to a dependency zip inside the s3 bucket, but keep getting back 'file not found'

CBredlow
  • 2,790
  • 2
  • 28
  • 47
  • 2
    Your question is very generic. The best way to read the data depends on how big it is and what do you want to do with it. Also, what format it is. The most generic way to move data from and to S3 is with the aws command line tools (https://aws.amazon.com/cli/). With it you can copy your data with something like `aws s3 cp myfile.txt s3://mybucket/myfile.txt`. As for the output, that depends on where your script writes to. – Roberto Congiu Nov 09 '17 at 22:22
  • What OP is saying is that AWS EMR supports running spark-submit as a step and has auto terminate on completion. @RobertoCongiu so how do we use `s3 cp src dest` to move the output to s3 automatically on completion. And how do we specify the input s3 folder. Does it overwrite files in s3? Preferably without s3-dist-cp. The objective is to have a one-click automation for getting output for spark in an s3 folder. – devssh Sep 06 '18 at 08:49
  • I guess the problem could be solved by adding `s3-dist-cp` as steps before and after the `spark-submit`. Can someone elaborate on how they have done that? – devssh Sep 06 '18 at 08:56

1 Answers1

2

MapReduce or Tez jobs in EMR can access S3 directly because of EMRFS (an AWS propriertary Hadoop filesystem implementation based on S3), e.g., in Apache Pig you can do loaded_data = LOAD 's3://mybucket/myfile.txt' USING PigStorage();

Not sure about Python-based Spark jobs. But one solution is to first copy the objects from S3 to the EMR HDFS, and then process them there.

There are multiple ways of doing the copy:

You can also use awscli (or hadoop fs -copyToLocal) to copy objects from S3 to the EMR master instance local disk (and vice versa), e.g., aws s3 cp s3://mybucket/myobject .

Yi Ou
  • 3,242
  • 1
  • 11
  • 12