2

I have installed spark 2.0 on EC2 & I am using SparkSQL using Scala to retrieve records from DB2 & I want to write to S3, where I am passing access keys to the Spark Context..Following is my code :

val df = sqlContext.read.format("jdbc").options(Map( "url" -> , "user" -> usernmae, "password" -> password, "dbtable" -> tablename, "driver" -> "com.ibm.db2.jcc.DB2Driver")).option("query", "SELECT * from tablename limit 10").load()
df.write.save("s3n://data-analytics/spark-db2/data.csv")

And it is throwing following exception :

org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: Service Error Message. -- ResponseCode: 403, ResponseStatus: Forbidden, XML Error Message: <?xml version="1.0" encoding="UTF-8"?><Error><Code>AccessDenied</Code><Message>Access Denied</Message><RequestId>1E77C38FA2DB34DA</RequestId><HostId>V4O9sdlbHwfXNFtoQ+Y1XYiPvIL2nTs2PIye5JBqiskMW60yDhHhnBoCHPDxLnTPFuzyKGh1gvM=</HostId></Error>
Caused by: org.jets3t.service.S3ServiceException: Service Error Message.
  at org.jets3t.service.S3Service.putObject(S3Service.java:2358)
  at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.storeEmptyFile(Jets3tNativeFileSystemStore.java:162)

What is the exact problem occurring here as I am passing the Access Keys also to Sparkcontext ?? Any other way to write to S3??

Akki
  • 493
  • 1
  • 11
  • 23
  • From the message Access Denied It may be user does not have sufficient privileges http://docs.aws.amazon.com/redshift/latest/dg/s3serviceexception-error.html – giaosudau Sep 01 '16 at 14:40

3 Answers3

3

After you get your keys, this is how to write out to s3 in scala/spark2 on s3n.

spark.sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "[access key]")
spark.sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "[secret key]")
spark.sparkContext.hadoopConfiguration.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")

df.write
.mode("overwrite")
.parquet("s3n://bucket/folder/parquet/myFile")

This is how to do it with s3a, which is preferred.

spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", "[access key]")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", "[secret key]")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")

df.write
.mode("overwrite")
.parquet("s3a://bucket/folder/parquet/myFile")

See this post to understand the differences between s3, s3n, and s3a.

Tony Fraser
  • 727
  • 7
  • 14
  • With pySpark on EMR, I get `'SparkContext' object has no attribute 'hadoopConfiguration'` – wordsforthewise Feb 20 '20 at 19:41
  • That's scala code and the original question was scala, but it's pretty similar in concept in pyspark. You get the context object and then you set the keys. Seems to be discussed similarly here for pyspark: https://stackoverflow.com/questions/32155617/connect-to-s3-data-from-pyspark – Tony Fraser Feb 21 '20 at 20:07
  • Right, I found the difference afterwards. I was using pySpark, and the syntax is more like `sc._jdc.HadoopConfiguration().set()` – wordsforthewise Feb 21 '20 at 23:50
0

When you create an EC2 instance or an EMR cluster on AWS, you have the option during the creation process to attach an IAM role to that instance or cluster.

By default, an EC2 instance is not allowed to connect to S3. You'd need to make a role, and attach it to the instance first.

The purpose of attaching an IAM role is that an IAM role can be given permissions to use various other AWS services without the need for installing physical credentials on that instance. Given there was a access denied error, I assume that the instance doesn't have an IAM role attached to it with the sufficient permissions required to write to S3.

Here's how you create a new IAM role:

  • Navigate to the AWS Identity and Access Management (IAM) page.
  • click on Roles, create a new one.
  • Search for S3 in the search bar, and then select S3FullAccess (... or something that looks like that, I can't remember it off the top of my head)
  • Add whatever other services you want that role to have, too.
  • Save it.

For a regular old single EC2 instance, click create a new instance:

  • and in the page of the instance creation steps, where you choose the VPC, and subnet, there is a selectbox for IAM role, click that and choose your newly created role.
  • continue and create your instance as you did before. Now that instance has the permissions to write to S3. voila!

For an EMR cluster:

  • create your EMR cluster, and then navigate to the GUI page where you see your new cluster's details. Find the area on the right that says EMR Role, and then go find that role in your IAM area, and edit it by adding the S3 full permissions.
  • Save your changes.
Kristian
  • 21,204
  • 19
  • 101
  • 176
0

You may try this

df.write.mode("append").format("csv").save("path/to/s3/bucket");
hitttt
  • 1,189
  • 4
  • 19
  • 38