There are three properties in my spark-defaults.conf
that I want to be able to set dynamically:
spark.driver.maxResultSize
spark.hadoop.fs.s3a.access.key
spark.hadoop.fs.s3a.secret.key
Here's my attempt to do so:
from pyspark import SparkConf
from pyspark.sql import SparkSession
conf = (SparkConf()
.setMaster(spark_master)
.setAppName(app_name)
.set('spark.driver.maxResultSize', '5g')
.set('spark.hadoop.fs.s3a.access.key', '<access>')\
.set('spark.hadoop.fs.s3a.secret.key', '<secret>)
)
spark = SparkSession.builder.\
config(conf=conf).\
getOrCreate()
print(spark.conf.get('spark.driver.maxResultSize'))
print(spark.conf.get('spark.hadoop.fs.s3a.access.key'))
print(spark.conf.get('spark.hadoop.fs.s3a.secret.key'))
spark.stop()
Here's the output I get:
5g
<access>
<secret>
However when I try to read a csv file on S3 using this configuration, I get a permissions denied error.
If I set the credentials via environment variables, I am able to read the file.
Why doesn't Hadoop respect the credentials specified this way?
Update:
I am aware of other Q&As relating to setting Hadoop properties in pyspark.
Here I am trying to record for posterity how you can be fooled into thinking that you can set them dynamically via spark.hadoop.*
, since that is the name you use to set these properties in spark-defaults.conf
, and since you don't get an error directly when you try to set them this way.
Many sites tell you to "set the spark.hadoop.fs.s3a.access.key
property", but don't specify that this only the case if you set it statically in spark-defaults.conf
and not dynamically in pyspark
.