1

I want to access s3 from spark, I don't want to configure any secret and access keys, I want to access with configuring the IAM role, so I followed the steps given in s3-spark

But still it is not working from my EC2 instance (which is running standalone spark)

it works when I tested

[ec2-user@ip-172-31-17-146 bin]$ aws s3 ls s3://testmys3/
2019-01-16 17:32:38        130 e.json

but it did not work when I tried like below

scala> val df = spark.read.json("s3a://testmys3/*")

I am getting the below error

19/01/16 18:23:06 WARN FileStreamSink: Error while looking for metadata directory.
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 400, AWS Service: Amazon S3, AWS Request ID: E295957C21AFAC37, AWS Error Code: null, AWS Error Message: Bad Request
  at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798)
  at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421)
  at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
  at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
  at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
  at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
  at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)
  at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
  at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
  at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
  at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
  at org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:616)
scoder
  • 2,451
  • 4
  • 30
  • 70
  • I don't have any expertise in spark but what happens if you specify the file itself like s3a://testmys3/e.json – sudo Jan 16 '19 at 20:25
  • Why are you not using EMR for this? – Thiago Baldim Jan 16 '19 at 23:49
  • @Thiago: maybe he's not using EMR because he wants his own build of Spark not whatever closed source fork the EMR team offer, or using a spark release provided by the ASF or someone else. Or he wants to use the S3A connector which is now moving ahead of EMR's closed-source s3 connector and comes with stackoverflow and apache JIRA support? – stevel Jan 17 '19 at 12:10

3 Answers3

4

this config worked

    ./spark-shell \
        --packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 \
        --conf spark.hadoop.fs.s3a.endpoint=s3.us-east-2.amazonaws.com \
spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.InstanceProfileCredentialsProvider \
        --conf spark.executor.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true \
        --conf spark.driver.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true  

Shubham Jain
  • 5,327
  • 2
  • 15
  • 38
scoder
  • 2,451
  • 4
  • 30
  • 70
  • Thank you, this helped me out. Small note: I was able to get this to work with a KMS-encrypted S3 bucket by using the AWS packages from Hadoop 2.9.2: aws-java-sdk-bundle-1.11.199.jar and hadoop-aws-2.9.2.jar – vertigokidd May 07 '19 at 17:37
  • Thanks, this worked. I have made some changes here[link](https://stackoverflow.com/a/62260235/11768156) – Pranjal Gharat Jun 08 '20 at 10:44
1

"400 Bad Request" is fairly unhelpful, and not only does S3 not provide much, the S3A connector doesn't date print much related to auth either. There's a big section on troubleshooting the error

The fact it got as far as making a request means that it has some credentials, only the far end doesn't like them

Possibilities

  • your IAM role doesn't have the permissions for s3:ListBucket. See IAM role permissions for working with s3a
  • your bucket name is wrong
  • There's some settings in fs.s3a or the AWS_ env vars which get priority over the IAM role, and they are wrong.

You should automatically have IAM auth as an authentication mechanism with the S3A connector; its the one which is checked last after: config & env vars.

  1. Have a look at what is set in fs.s3a.aws.credentials.provider -it must be unset or contain the option com.amazonaws.auth.InstanceProfileCredentialsProvider
  2. assuming you also have hadoop on the command line, grab storediag
hadoop jar cloudstore-0.1-SNAPSHOT.jar storediag s3a://testmys3/

it should dump what it is up to regarding authentication.

Update

As the original poster has commented, it was due to v4 authentication being required on the specific S3 endpoint. This can be enabled on the 2.7.x version of the s3a client, but only via Java system properties. For 2.8+ there are some fs.s3a. options you can set it instead

stevel
  • 12,567
  • 1
  • 39
  • 50
  • thanks, it worked with below config $./spark-shell --packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 --conf spark.hadoop.fs.s3a.endpoint=s3.us-east-2.amazonaws.com --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem --conf spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.InstanceProfileCredentialsProvider --conf spark.executor.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true --conf spark.driver.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true – scoder Jan 17 '19 at 17:38
  • I get it. V4 endpoint like frankfurt, london, korea. As you've found, it takes effort in the 2.7.x hadoop release to do this. There's explicit support in 2.8+ – stevel Jan 18 '19 at 09:39
0
  • step1. to config spark container framework like Yarn core-site.xml.Then restart Yarn

fs.s3a.aws.credentials.provider-- com.cloudera.com.amazonaws.auth.InstanceProfileCredentialsProvider

fs.s3a.endpoint-- s3-ap-northeast-2.amazonaws.com

fs.s3.impl-- org.apache.hadoop.fs.s3a.S3AFileSystem

  • step2. spark shell to test as follow.

val rdd=sc.textFile("s3a://path/file")
 rdd.count()
 rdd.take(10).foreach(println)

It works for me