0

I need to externalize the Spark Configs in our job.conf files so that they can be read from an external location and modified only in that one external location to use at runtime.

Configs such as spark.executor.memory spark.executor.cores spark.executor.instances spark.sql.adaptive.enabled spark.sql.legacy.timeParserPolicy

Would be stored in this file.

I am very new to this and am finding very limited resources on the web about handling this process. I've seen a couple of YouTubes about using a scala file to handle this. Any assistance would be greatly appreciated.

I have attempted to emulate the scala examples I have seen online, but don't know how to call the resulting file from Spark (or even if the scala is correct to begin with).

Art Booth
  • 3
  • 2

1 Answers1

1

TL;DR:

  • you can put your config in $SPARK_HOME/conf/spark-defaults.conf
  • or if you're submitting your jobs explicitly using spark-submit or something then you can also pass them on command line using --conf.

Spark configuration docs leave a bit to be desired.

As described in Dynamically Loading Spark Properties section:

bin/spark-submit will also read configuration options from conf/spark-defaults.conf, in which each line consists of a key and a value separated by whitespace. For example:

spark.master            spark://5.6.7.8:7077
spark.executor.memory   4g
spark.eventLog.enabled  true
spark.serializer        org.apache.spark.serializer.KryoSerializer

Official documentation doesn't explicitly mention the location except in passing in this para related to hadoop config.

Some IBM doc has it more explicitly.

Also FYI: How to set hadoop configuration values from pyspark

Kashyap
  • 15,354
  • 13
  • 64
  • 103