Ho to set hadoop configuration values from pyspark after 2019?

Question

The accepted answer to this question:

sc._jsc.hadoopConfiguration().set('my.mapreduce.setting', 'someVal')

stopped working due to removal of _jsc attribute as pointed out by a comment to that answer from 2019.

How does one set hadoop configuration from pyspark in versions after _jsc attribute was removed?

score 1 · Accepted Answer · answered Mar 16 '23 at 01:31

As documented (and seen in code) any config item in Spark Config prefixed with spark.hadoop is copied over to Hadoop Config without the prefix. So to set my.mapreduce.setting in Hadoop conf, set spark.hadoop.my.mapreduce.setting in Spark conf.

The better choice is to use spark hadoop properties in the form of spark.hadoop., and use spark hive properties in the form of spark.hive.. For example, adding configuration “spark.hadoop.abc.def=xyz” represents adding hadoop property “abc.def=xyz”, and adding configuration “spark.hive.abc=xyz” represents adding hive property “hive.abc=xyz”. They can be considered as same as normal spark properties...

To set spark config name to value.

# python
spark.conf.set(<name>, <value>)

// scala
spark.conf.set(<name>, <value>)

# R
library(SparkR)
sparkR.session()
sparkR.session(sparkConfig = list(name = "<value>"))

-- SQL
SET <name> = <value>;

Command line with spark-submit or pyspark or spark-shell:

--conf "name=value"

Precedence. As documented.

Properties set directly on the SparkConf take highest precedence, then flags passed to spark-submit or spark-shell, then options in the spark-defaults.conf file.

Some links:

Ho to set hadoop configuration values from pyspark after 2019?

1 Answers1

Linked