6

I have run into some difficulty when trying to run a Spark-MLLIB example through pyspark on an AWS-EMR cluster utilizing a file on S3. It is clear this is a permissions issue (403 error), but at this point I could use a hand figuring out where to look.

  1. Setup IAM - I created a user "dmoccia" under IAM Users
  2. I created a "machineLearning" group under IAM and added "dmoccia" to it
  3. I added the following managed policies: AmazonEC2FullAccess, AmazonS3FullAccess, AmazonElasticMapReduceFullAccess, AmazonElasticMapReduceRole, AmazonElasticMapReduceForEC2Role to the "machineLearning" group
  4. I then created a bucket "SparkTestingXYZ", I dropped an example file from Spark Mllib in the bucket (I was planning on testing with this simple Linear Regression model from the docs)
  5. I created the default roles for EMR from the AWSCLI

    $ aws emr create-default-roles 
    

    I followed this up by adding the following policies to the two roles (EMR_DefaultRole and EMR_EC2_DefaultRole) created: AmazonEC2FullAccess, AmazonS3FullAccess, AmazonElasticMapReduceFullAccess, AmazonElasticMapReduceRole, AmazonElasticMapReduceForEC2Role

  6. I then went ahead and fired up a cluster via AWSCLI

    $ aws emr create-cluster --name "Spark cluster" --release-label emr-5.0.0 --applications Name=Spark --ec2-attributes KeyName=AWS_Spark_Test --instance-type m3.xlarge --instance-count 3 --use-default-roles
    
  7. I then SSH into the master, fired up pyspark and tried the following code:

     from pyspark.sql import SparkSession
     spark = SparkSession\
        .builder\
        .appName("PythonSQL")\
        .config("spark.some.config.option", "some-value")\
        .getOrCreate()
     training = spark.read.format("libsvm").load("s3://s3.amazonaws.com/SparkTestingXYZ/sample_linear_regression_data.txt")
    

which in turn outputs the attached stack trace, but here is the header:

 16/08/12 17:44:19 WARN DataSource: Error while looking for metadata directory.
 Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "/usr/lib/spark/python/pyspark/sql/readwriter.py", line 147, in load
 return self._df(self._jreader.load(path))
 File "/usr/lib/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",   line 933, in __call__
 File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
 File "/usr/lib/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value py4j.protocol.Py4JJavaError: 
 An error occurred while calling o49.load.: java.io.IOException:  com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Excepti on: 
 Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: XXXXXXXXXXX), S3 Extended Request ID: OnI4tyk8Uun6JbD0iVnEi+kvjDfSK2U=

From here I even tried the following when launching the cluster

 $ aws emr create-cluster --name "Spark cluster" --release-label emr-5.0.0 --applications Name=Spark --service-role EMR_DefaultRole --ec2-attributes InstanceProfile=EMR_EC2_DefaultRole,KeyName=AWS_Spark_Test --instance-type m3.xlarge --instance-count 3 

At this point I am not sure which permission I need to set, or perhaps I am missing a policy. If someone could point me in the right direction I would really appreciate it. I figure if I can get this working I can dial back some of the full access policies, but at this point I want to be able to read the file. One last thing, I did fire up both S3/Cluster in the same region (US Standard /us-east-1). Thank you if you made it this far!

Dennis
  • 401
  • 5
  • 17
  • 1
    Quick Update: I was able to access a public S3 bucket (https://s3.amazonaws.com/elasticmapreduce/samples/wordcount/wordSplitter.py) so I assume this means the permissions are ok, but somehow I set up my bucket incorrectly. I tried making it public, but still could not read from it – Dennis Aug 15 '16 at 22:34
  • 1
    I was able to get the file via curl once the S3 bucket was made public, still no success when attempting to pull it into python – Dennis Aug 15 '16 at 22:49
  • 1
    I have solved a good portion of this issue, I plan on posting the full solution once I have done more testing using "Steps" in EMR. I would like to avoid SSHing into the cluster in order to run code. – Dennis Aug 17 '16 at 18:54
  • 2
    I solved this issue using s3a protocol + ProfileCredentialsProvider (http://stackoverflow.com/a/43910762/1879686). – rdllopes May 11 '17 at 08:59
  • 1
    Thanks for commenting on this @rdllopes – Dennis May 12 '17 at 11:37

0 Answers0