0

There are three properties in my spark-defaults.conf that I want to be able to set dynamically:

  • spark.driver.maxResultSize
  • spark.hadoop.fs.s3a.access.key
  • spark.hadoop.fs.s3a.secret.key

Here's my attempt to do so:

from pyspark import SparkConf
from pyspark.sql import SparkSession

conf = (SparkConf()
        .setMaster(spark_master)
        .setAppName(app_name)
        .set('spark.driver.maxResultSize', '5g')
        .set('spark.hadoop.fs.s3a.access.key', '<access>')\
        .set('spark.hadoop.fs.s3a.secret.key', '<secret>)
        )

spark = SparkSession.builder.\
    config(conf=conf).\
    getOrCreate()

print(spark.conf.get('spark.driver.maxResultSize'))
print(spark.conf.get('spark.hadoop.fs.s3a.access.key'))
print(spark.conf.get('spark.hadoop.fs.s3a.secret.key'))

spark.stop()

Here's the output I get:

5g
<access>
<secret>

However when I try to read a csv file on S3 using this configuration, I get a permissions denied error.

If I set the credentials via environment variables, I am able to read the file.

Why doesn't Hadoop respect the credentials specified this way?

Update:

I am aware of other Q&As relating to setting Hadoop properties in pyspark.

Here I am trying to record for posterity how you can be fooled into thinking that you can set them dynamically via spark.hadoop.*, since that is the name you use to set these properties in spark-defaults.conf, and since you don't get an error directly when you try to set them this way.

Many sites tell you to "set the spark.hadoop.fs.s3a.access.key property", but don't specify that this only the case if you set it statically in spark-defaults.conf and not dynamically in pyspark.

proinsias
  • 304
  • 4
  • 21

1 Answers1

4

It turns out that you can't specify Hadoop properties via:

spark.conf.set('spark.hadoop.<property>', <value>)

but you must instead use:

spark.sparkContext._jsc.hadoopConfiguration().set('<property>', <value>)

I believe you can only use spark.conf.set() for the properties listed on the Spark Configuration page.

proinsias
  • 304
  • 4
  • 21
  • sparkContext.hadoopConfiguration().set(key,value) Check the integration test for http://bytepadding.com/big-data/spark/combineparquetfileinputformat/ – KrazyGautam Mar 13 '17 at 00:02
  • @KrazyGautam – this is for `pyspark` not scala/java. – proinsias Mar 13 '17 at 19:20
  • You can actually set the hadoop related properties via spark conf. At least in 2.2.1 version that I tested in my scala code. So it goes like this: `SparkSession.builder() .master("local[*]") .appName(s"theApplicationName") .config("spark.sql.shuffle.partitions", 1) .config("spark.hadoop.fs.defaultFS", "file:///") .getOrCreate() ` – soMuchToLearnAndShare Dec 11 '19 at 11:47
  • @MinnieShi – this is for pyspark, not scala. – proinsias Dec 13 '19 at 01:30
  • @proinsias Thanks but I couldn't set `fs.defaultFS`. It gives me error `An error occurred while calling z:org.apache.hadoop.fs.FileSystem.get. : java.lang.NullPointerException`. Is it because it's unsupported property which is not suggested in the list of Spark Configuration page? – sngjuk Nov 23 '22 at 14:08