Running Multiple Spark Sessions with Different Configurations within Same Glue Job

Question

I have a PySpark code running in a Glue job. The job takes an argument called 'update_mode'. I want to set different configuration for spark depending on the update_mode is full_overwrite vs upsert. Specifically, I want to switch this spark config spark.sql.sources.partitionOverwriteMode between static vs dynamic. I tried creating two spark sessions and using the respective spark object but it doesn't behave as expected. The other option I can think of is just creating two separate jobs with different configurations.

Any other ideas to do it in the same job?

score 0 · Answer 1 · answered Mar 16 '23 at 02:33

Never worked with Glue, so not sure how you're submitting your jobs. But as described here, you can configure properties per job using --conf spark.sql.sources.partitionOverwriteMode=... on CLI.

In some cases, you may want to avoid hard-coding certain configurations in a SparkConf. For instance, Spark allows you to simply create an empty conf and set spark/spark hadoop/spark hive properties.
val conf = new SparkConf().set("spark.hadoop.abc.def", "xyz")
val sc = new SparkContext(conf)
Also, you can modify or add configurations at runtime:
./bin/spark-submit \
     --name "My app" \ 
     --master local[4] \  
     --conf spark.eventLog.enabled=false \ 
     --conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" \ 
     --conf spark.hadoop.abc.def=xyz \
     --conf spark.hive.abc=xyz
     myApp.jar

Or more dynamically from code as described here. Your code in turn can read it from anywhere you choose to.

Running Multiple Spark Sessions with Different Configurations within Same Glue Job

1 Answers1