How to set hadoop configuration values from pyspark

Question

The Scala version of SparkContext has the property

sc.hadoopConfiguration

I have successfully used that to set Hadoop properties (in Scala)

e.g.

sc.hadoopConfiguration.set("my.mapreduce.setting","someVal")

However the python version of SparkContext lacks that accessor. Is there any way to set Hadoop configuration values into the Hadoop Configuration used by the PySpark context?

score 87 · Accepted Answer · answered Sep 18 '15 at 21:30

87

sc._jsc.hadoopConfiguration().set('my.mapreduce.setting', 'someVal')

should work

answered Sep 18 '15 at 21:30

Dmytro Popovych

944
7
2

12

This solution also applies to anyone trying to get their AWS AccessKeyId/SecretAccessKey to be accepted when using s3n:// addresses. `sc._jsc.hadoopConfiguration().set('fs.s3n.awsAccessKeyId','')` – Lucian Thorr Nov 14 '16 at 18:00
2

As of today, seems _jsc is no longer available. – Z.Wei Jul 25 '19 at 15:09
2

I've filed [SPARK-33436](https://issues.apache.org/jira/browse/SPARK-33436) to track adding `hadoopConfiguration` directly to the PySpark API, so that using `._jsc` is no longer necessary. – Nick Chammas Nov 12 '20 at 19:12
I couldn't set `fs.defaultFS` with this solution. It gives me error `An error occurred while calling z:org.apache.hadoop.fs.FileSystem.get. : java.lang.NullPointerException`. Is it because it's unsupported property to set? – sngjuk Nov 23 '22 at 14:14
[added answer](https://stackoverflow.com/a/75751442/496289) for newer versions of spark after `_jsc` is removed. – Kashyap Mar 16 '23 at 01:34

score 5 · Answer 2 · edited Jul 17 '18 at 18:07

5

You can set any Hadoop properties using the --conf parameter while submitting the job.

--conf "spark.hadoop.fs.mapr.trace=debug"

Source: https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L105

edited Jul 17 '18 at 18:07

Alper t. Turker

34,230
9
83
115

answered May 05 '17 at 04:14

Harikrishnan Ck

920
1
11
12

score 4 · Answer 3 · edited Jul 17 '18 at 18:08

I looked into the PySpark source code (context.py) and there is not a direct equivalent. Instead some specific methods support sending in a map of (key,value) pairs:

fileLines = sc.newAPIHadoopFile('dev/*', 
'org.apache.hadoop.mapreduce.lib.input.TextInputFormat',
'org.apache.hadoop.io.LongWritable',
'org.apache.hadoop.io.Text',
conf={'mapreduce.input.fileinputformat.input.dir.recursive':'true'}
).count()

How to set hadoop configuration values from pyspark

3 Answers3

Linked

Related