43

The Scala version of SparkContext has the property

sc.hadoopConfiguration

I have successfully used that to set Hadoop properties (in Scala)

e.g.

sc.hadoopConfiguration.set("my.mapreduce.setting","someVal")

However the python version of SparkContext lacks that accessor. Is there any way to set Hadoop configuration values into the Hadoop Configuration used by the PySpark context?

Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115
WestCoastProjects
  • 58,982
  • 91
  • 316
  • 560

3 Answers3

87
sc._jsc.hadoopConfiguration().set('my.mapreduce.setting', 'someVal')

should work

Dmytro Popovych
  • 944
  • 7
  • 2
  • 12
    This solution also applies to anyone trying to get their AWS AccessKeyId/SecretAccessKey to be accepted when using s3n:// addresses. `sc._jsc.hadoopConfiguration().set('fs.s3n.awsAccessKeyId','')` – Lucian Thorr Nov 14 '16 at 18:00
  • 2
    As of today, seems _jsc is no longer available. – Z.Wei Jul 25 '19 at 15:09
  • 2
    I've filed [SPARK-33436](https://issues.apache.org/jira/browse/SPARK-33436) to track adding `hadoopConfiguration` directly to the PySpark API, so that using `._jsc` is no longer necessary. – Nick Chammas Nov 12 '20 at 19:12
  • I couldn't set `fs.defaultFS` with this solution. It gives me error `An error occurred while calling z:org.apache.hadoop.fs.FileSystem.get. : java.lang.NullPointerException`. Is it because it's unsupported property to set? – sngjuk Nov 23 '22 at 14:14
  • [added answer](https://stackoverflow.com/a/75751442/496289) for newer versions of spark after `_jsc` is removed. – Kashyap Mar 16 '23 at 01:34
5

You can set any Hadoop properties using the --conf parameter while submitting the job.

--conf "spark.hadoop.fs.mapr.trace=debug"

Source: https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L105

Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115
Harikrishnan Ck
  • 920
  • 1
  • 11
  • 12
4

I looked into the PySpark source code (context.py) and there is not a direct equivalent. Instead some specific methods support sending in a map of (key,value) pairs:

fileLines = sc.newAPIHadoopFile('dev/*', 
'org.apache.hadoop.mapreduce.lib.input.TextInputFormat',
'org.apache.hadoop.io.LongWritable',
'org.apache.hadoop.io.Text',
conf={'mapreduce.input.fileinputformat.input.dir.recursive':'true'}
).count()
Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115
WestCoastProjects
  • 58,982
  • 91
  • 316
  • 560