0

I have a PySpark code running in a Glue job. The job takes an argument called 'update_mode'. I want to set different configuration for spark depending on the update_mode is full_overwrite vs upsert. Specifically, I want to switch this spark config spark.sql.sources.partitionOverwriteMode between static vs dynamic. I tried creating two spark sessions and using the respective spark object but it doesn't behave as expected. The other option I can think of is just creating two separate jobs with different configurations.

Any other ideas to do it in the same job?

Jatin
  • 75
  • 8

1 Answers1

0

Never worked with Glue, so not sure how you're submitting your jobs. But as described here, you can configure properties per job using --conf spark.sql.sources.partitionOverwriteMode=... on CLI.

In some cases, you may want to avoid hard-coding certain configurations in a SparkConf. For instance, Spark allows you to simply create an empty conf and set spark/spark hadoop/spark hive properties.

val conf = new SparkConf().set("spark.hadoop.abc.def", "xyz")
val sc = new SparkContext(conf)

Also, you can modify or add configurations at runtime:

./bin/spark-submit \
     --name "My app" \ 
     --master local[4] \  
     --conf spark.eventLog.enabled=false \ 
     --conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" \ 
     --conf spark.hadoop.abc.def=xyz \
     --conf spark.hive.abc=xyz
     myApp.jar

Or more dynamically from code as described here. Your code in turn can read it from anywhere you choose to.

Kashyap
  • 15,354
  • 13
  • 64
  • 103