I have run into some difficulty when trying to run a Spark-MLLIB example through pyspark on an AWS-EMR cluster utilizing a file on S3. It is clear this is a permissions issue (403 error), but at this point I could use a hand figuring out where to look.
- Setup IAM - I created a user "dmoccia" under IAM Users
- I created a "machineLearning" group under IAM and added "dmoccia" to it
- I added the following managed policies: AmazonEC2FullAccess, AmazonS3FullAccess, AmazonElasticMapReduceFullAccess, AmazonElasticMapReduceRole, AmazonElasticMapReduceForEC2Role to the "machineLearning" group
- I then created a bucket "SparkTestingXYZ", I dropped an example file from Spark Mllib in the bucket (I was planning on testing with this simple Linear Regression model from the docs)
I created the default roles for EMR from the AWSCLI
$ aws emr create-default-roles
I followed this up by adding the following policies to the two roles (EMR_DefaultRole and EMR_EC2_DefaultRole) created: AmazonEC2FullAccess, AmazonS3FullAccess, AmazonElasticMapReduceFullAccess, AmazonElasticMapReduceRole, AmazonElasticMapReduceForEC2Role
I then went ahead and fired up a cluster via AWSCLI
$ aws emr create-cluster --name "Spark cluster" --release-label emr-5.0.0 --applications Name=Spark --ec2-attributes KeyName=AWS_Spark_Test --instance-type m3.xlarge --instance-count 3 --use-default-roles
I then SSH into the master, fired up pyspark and tried the following code:
from pyspark.sql import SparkSession spark = SparkSession\ .builder\ .appName("PythonSQL")\ .config("spark.some.config.option", "some-value")\ .getOrCreate() training = spark.read.format("libsvm").load("s3://s3.amazonaws.com/SparkTestingXYZ/sample_linear_regression_data.txt")
which in turn outputs the attached stack trace, but here is the header:
16/08/12 17:44:19 WARN DataSource: Error while looking for metadata directory.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/readwriter.py", line 147, in load
return self._df(self._jreader.load(path))
File "/usr/lib/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/usr/lib/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value py4j.protocol.Py4JJavaError:
An error occurred while calling o49.load.: java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Excepti on:
Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: XXXXXXXXXXX), S3 Extended Request ID: OnI4tyk8Uun6JbD0iVnEi+kvjDfSK2U=
From here I even tried the following when launching the cluster
$ aws emr create-cluster --name "Spark cluster" --release-label emr-5.0.0 --applications Name=Spark --service-role EMR_DefaultRole --ec2-attributes InstanceProfile=EMR_EC2_DefaultRole,KeyName=AWS_Spark_Test --instance-type m3.xlarge --instance-count 3
At this point I am not sure which permission I need to set, or perhaps I am missing a policy. If someone could point me in the right direction I would really appreciate it. I figure if I can get this working I can dial back some of the full access policies, but at this point I want to be able to read the file. One last thing, I did fire up both S3/Cluster in the same region (US Standard /us-east-1). Thank you if you made it this far!