Pyspark AWS credentials

Question

I'm trying to run a PySpark script that works fine when I run it on my local machine. The issue is that I want to fetch the input files from S3.

No matter what I try though I can't seem to be able find where I set the ID and secret. I found some answers regarding specific files ex: Locally reading S3 files through Spark (or better: pyspark) but I want to set the credentials for the whole SparkContext as I reuse the sql context all over my code.

so the question is: How do I set the AWS Access key and secret to spark?

P.S I tried the $SPARK_HOME/conf/hdfs-site.xml and Environment variable options. both didn't work...

Thank you

score 14 · Accepted Answer · answered Oct 26 '17 at 19:03

14

For pyspark we can set the credentials as given below

  sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", AWS_ACCESS_KEY)
  sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", AWS_SECRET_KEY)

answered Oct 26 '17 at 19:03

Sahil Desai

3,418
4
20
41

1

Thanks, this seem to do it – Roee N Oct 29 '17 at 07:44
5

Just for future people looking for this, keep in mind that sc is the SparkContext: sc = SparkContext.getOrCreate(conf) – Roee N Oct 29 '17 at 12:38

score 11 · Answer 2 · answered Jan 07 '19 at 19:12

Setting spark.hadoop.fs.s3a.access.key and spark.hadoop.fs.s3a.secret.key in spark-defaults.conf before establishing a spark session is a nice way to do it.

But, also had success with Spark 2.3.2 and a pyspark shell setting these dynamically from within a spark session doing the following:

spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.access.key", AWS_ACCESS_KEY_ID)
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.secret.key", AWS_SECRET_ACCESS_KEY)

And then, able to read/write from S3 using s3a:

documents = spark.sparkContext.textFile('s3a://bucket_name/key')

The keys "fs.s3a.access.key" and "fs.s3a.secret.key" worked for me where using "fs.s3a.awsAccessKeyId" and "fs.s3a.awsSecretAccessKey" did not. — Lucian Thorr, Mar 13 '19 at 14:23

score 5 · Answer 3 · answered Jun 04 '20 at 02:30

I'm not sure if this was true at the time, but as of PySpark 2.4.5 you don't need to access the private _jsc object to set Hadoop properties. You can set Hadoop properties using SparkConf.set(). For example:

import pyspark
conf = (
    pyspark.SparkConf()
        .setAppName('app_name')
        .setMaster(SPARK_MASTER)
        .set('spark.hadoop.fs.s3a.access.key', AWS_ACCESS_KEY)
        .set('spark.hadoop.fs.s3a.secret.key', AWS_SECRET_KEY)
)

sc = pyspark.SparkContext(conf=conf)

See https://spark.apache.org/docs/latest/configuration.html#custom-hadoophive-configuration

score 0 · Answer 4 · answered Oct 26 '17 at 15:27

0

You can see a couple of suggestions here: http://www.infoobjects.com/2016/02/27/different-ways-of-setting-aws-credentials-in-spark/

I usually do the 3rd one (set hadoopConfig on the SparkContext), as I want the credentials to be parameters within my code. So that I can run it from any machine.

For example:

JavaSparkContext javaSparkContext = new JavaSparkContext();
javaSparkContext.sc().hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "");
javaSparkContext.sc().hadoopConfiguration().set("fs.s3n.awsSecretAccessKey","");

answered Oct 26 '17 at 15:27

AlexM

334
2
4
16

I was looking for the answer in PySpark... thanks though, it looks like a correct answer – Roee N Oct 29 '17 at 07:45
Sorry about that, I'm so used to working in Java, I completely forgot you asked for Pyspark :) – AlexM Oct 29 '17 at 08:18

score 0 · Answer 5 · answered Nov 13 '18 at 12:29

The method where you add the AWS_ACCESS_KEY and AWS_SECRET_ACCESS_KEY to hdfs-site.xml should ideally work. Just ensure that you run pyspark or spark-submit as follows:

spark-submit --master "local[*]" \
    --driver-class-path /usr/src/app/lib/mssql-jdbc-6.4.0.jre8.jar \
    --jars /usr/src/app/lib/hadoop-aws-2.6.0.jar,/usr/src/app/lib/aws-java-sdk-1.11.443.jar,/usr/src/app/lib/mssql-jdbc-6.4.0.jre8.jar \
    repl-sql-s3-schema-change.py


pyspark --jars /usr/src/app/lib/hadoop-aws-2.6.0.jar,/usr/src/app/lib/aws-java-sdk-1.11.443.jar,/usr/src/app/lib/mssql-jdbc-6.4.0.jre8.jar

score 0 · Answer 6 · answered Jan 08 '19 at 21:57

0

Setting them in core-site.xml, provided that directory is on the classpath, should work.

answered Jan 08 '19 at 21:57

stevel

12,567
1
39
50

Pyspark AWS credentials

6 Answers6

Linked