50

I am trying to overwrite the spark session/spark context default configs, but it is picking entire node/cluster resource.

 spark  = SparkSession.builder
                      .master("ip")
                      .enableHiveSupport()
                      .getOrCreate()

 spark.conf.set("spark.executor.memory", '8g')
 spark.conf.set('spark.executor.cores', '3')
 spark.conf.set('spark.cores.max', '3')
 spark.conf.set("spark.driver.memory",'8g')
 sc = spark.sparkContext

It works fine when i put the configuration in spark submit

spark-submit --master ip --executor-cores=3 --diver 10G code.py
Harish
  • 969
  • 2
  • 10
  • 15
  • What is the resource manager ? Spark Standalone/YARN – mrsrinivas Jan 27 '17 at 03:22
  • Other way with 2.0 is `conf = (SparkConf().set("spark.executor.cores", "3")); spark = SparkSession.builder .master("ip").conf(conf=conf) .enableHiveSupport() .getOrCreate()` – mrsrinivas Jan 27 '17 at 04:10
  • Sorry, tried both no luck. Can you try once. I just updated my spark to 2.2.0 snapshot to over come 64KB code size issue(SPARK-16845). – Harish Jan 27 '17 at 04:29

6 Answers6

59

You aren't actually overwriting anything with this code. Just so you can see for yourself try the following.

As soon as you start pyspark shell type:

sc.getConf().getAll()

This will show you all of the current config settings. Then try your code and do it again. Nothing changes.

What you should do instead is create a new configuration and use that to create a SparkContext. Do it like this:

conf = pyspark.SparkConf().setAll([('spark.executor.memory', '8g'), ('spark.executor.cores', '3'), ('spark.cores.max', '3'), ('spark.driver.memory','8g')])
sc.stop()
sc = pyspark.SparkContext(conf=conf)

Then you can check yourself just like above with:

sc.getConf().getAll()

This should reflect the configuration you wanted.

Grr
  • 15,553
  • 7
  • 65
  • 85
  • 1
    In spark 2.1.0/2.2.0 we can define sc = pyspark.SparkContext like this. No option to pass the parameter. – Harish Feb 03 '17 at 20:53
  • Are you saying its not possible to pass it in? The docs still have it listed as an argument, see [here](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext) – Grr Feb 03 '17 at 20:56
  • 1
    [See here https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql] . i am not clear what is the entry point now? – Harish Feb 03 '17 at 21:15
  • If you are referring to [this](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.SparkSession.sparkContext) line, that refers to checking the existing spark context object. So for example when you start pyspark the sparkcontext already exists as sc. Typing `sc` is essentially equal to typing `SparkSession.SparkContext` and returns the current context object. My understanding is that you want to create a context with a different configuration. – Grr Feb 03 '17 at 21:19
  • 4
    I have done small chnages and it worked ..Thank you.. `spark = SparkSession.builder.config(conf=conf1).getOrCreate()` `sc = spark.sparkContext` here conf1 is what you defined above (conf = **) – Harish Feb 06 '17 at 04:20
  • It is a bug [refer](https://issues.apache.org/jira/browse/SPARK-19307​) – Harish Apr 18 '17 at 01:49
  • I am finding it is ignoring the SparkContext settings (PYTHONPATH) for example. Any known issues with SparkSession not picking these up? I can see the settings there. – mathtick Feb 27 '18 at 20:19
  • @Grr can you please update the answer so it works with 2.3.1 ? – PolarBear10 Jul 30 '18 at 08:16
  • Can't we setup config before spark start ?? – Mithril Mar 04 '19 at 01:43
  • @mithril Absolutely, and i would always do so. But the question was about changing the config after the fact. – Grr Mar 04 '19 at 06:49
44

update configuration in Spark 2.3.1

To change the default spark configurations you can follow these steps:

Import the required classes

from pyspark.conf import SparkConf
from pyspark.sql import SparkSession

Get the default configurations

spark.sparkContext._conf.getAll()

Update the default configurations

conf = spark.sparkContext._conf.setAll([('spark.executor.memory', '4g'), ('spark.app.name', 'Spark Updated Conf'), ('spark.executor.cores', '4'), ('spark.cores.max', '4'), ('spark.driver.memory','4g')])

Stop the current Spark Session

spark.sparkContext.stop()

Create a Spark Session

spark = SparkSession.builder.config(conf=conf).getOrCreate()
bob
  • 4,595
  • 2
  • 25
  • 35
  • 1
    Also works with 2.2.0. Thanks for providing this answer. – dmn Feb 19 '19 at 16:47
  • 1
    I have used `spark.sparkContext._conf.setAll([('spark.executor.memory', '4g'), ...])` without recreating the session (meaning I did not use your two last steps). When I take a look at the config afterwards (`spark.sparkContext._conf.getAll()`) I also see the set parameters. However, I'm not sure if I just overwrote the object just saving the config within the `sparkConf`.. – Markus Jun 07 '19 at 08:59
  • @Markus: you can check the configurations in Spark UI – bob Jun 07 '19 at 09:46
  • you are using varaible 'spark' in conf and then using 'conf' variable in spark lol. how can i change the spark configuration once i start the session?? – imran Feb 04 '20 at 16:03
  • @imran: u need to stop the session and start it again. In code, spark variable is reassigned with new spark session using updated conf. – bob Feb 05 '20 at 05:19
  • Anyone could make this work when launching spark-submit in cluster mode?? I am only able to stop and restart sparkContext when running in client mode... – Javier de la Iglesia May 25 '20 at 06:29
  • 1
    @Markus, you overwrote an entry in spark.sparkContext._conf object, however that did affect he real properties of your spark object. The real properties of your SparkSession object are the ones you pass to object's constructor. – Michał Jabłoński Jul 07 '20 at 14:17
  • This will **not** update the ```spark.driver.memory```. See this answer for further details: https://stackoverflow.com/questions/53606756/how-to-set-spark-driver-memory-in-client-mode-pyspark-version-2-3-1/62799033#62799033 – Michał Jabłoński Jul 08 '20 at 16:20
  • @bob After reconfiguring my spark `conf` as you suggested, when I execute the SparkSession builder, iI get an error saying that `Promise already completed`. Any idea how to overcome that? – thentangler Sep 26 '20 at 18:15
  • Doesn't work for me: just broke Pyspark session and every cell now returns `AttributeError: 'NoneType' object has no attribute 'setJobGroup'` error :( – Mikhail_Sam May 31 '21 at 10:56
4

Setting 'spark.driver.host' to 'localhost' in the config works for me

spark = SparkSession \
    .builder \
    .appName("MyApp") \
    .config("spark.driver.host", "localhost") \
    .getOrCreate()
Vivek
  • 657
  • 11
  • 14
4

You could also set configuration when you start pyspark, just like spark-submit:

pyspark --conf property=value

Here is one example

-bash-4.2$ pyspark
Python 3.6.8 (default, Apr 25 2019, 21:02:35) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.0-cdh6.2.0
      /_/

Using Python version 3.6.8 (default, Apr 25 2019 21:02:35)
SparkSession available as 'spark'.
>>> spark.conf.get('spark.eventLog.enabled')
'true'
>>> exit()


-bash-4.2$ pyspark --conf spark.eventLog.enabled=false
Python 3.6.8 (default, Apr 25 2019, 21:02:35) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.0-cdh6.2.0
      /_/

Using Python version 3.6.8 (default, Apr 25 2019 21:02:35)
SparkSession available as 'spark'.
>>> spark.conf.get('spark.eventLog.enabled')
'false'
user3282611
  • 870
  • 1
  • 9
  • 9
1

I had a very different requirement where I had to check if I am getting parameters of executor and driver memory size and if getting, had to replace config with only changes in executer and driver. Below are the steps:

  1. Import Libraries
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
  1. Define Spark and get the default configuration
spark = (SparkSession.builder
        .master("yarn")
        .appName("experiment") 
        .config("spark.hadoop.fs.s3a.multiobjectdelete.enable", "false")
        .getOrCreate())

conf = spark.sparkContext._conf.getAll()
  1. Check if executor and driver size exists (I am giving here pseudo code 1 conditional check, rest you can create cases) then use the given configuration based on params or skip to the default configuration.
if executor_mem is not None and driver_mem  is not None:
    conf = spark.sparkContext._conf.setAll([('spark.executor.memory',executor_mem),('spark.driver.memory',driver_mem)])
    spark.sparkContext.stop()
    spark = SparkSession.builder.config(conf=conf).getOrCreate()
else:
    spark = spark

Don't forget to stop spark context, this will make sure executor and driver memory size have differed as you passed in params. Hope this helps!

Hari_pb
  • 7,088
  • 3
  • 45
  • 53
0

I know this is little old post and have some already accepted ans, but I just wanted to post a working code for the same.

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("MyApp") \
    .config("spark.driver.host", "localhost") \
    .getOrCreate()

default_conf = spark.sparkContext._conf.getAll()
print(default_conf)

conf = spark.sparkContext._conf.setAll([('spark.executor.memory', '4g'),
                                        ('spark.app.name', 'Spark Updated Conf'),
                                        ('spark.executor.cores', '4'),
                                        ('spark.cores.max', '4'),
                                        ('spark.driver.memory','4g')])

spark.sparkContext.stop()

spark = SparkSession \
    .builder \
    .appName("MyApp") \
    .config(conf=conf) \
    .getOrCreate()


default_conf = spark.sparkContext._conf.get("spark.cores.max")
print("updated configs " , default_conf)
Dharman
  • 30,962
  • 25
  • 85
  • 135
Suman Banerjee
  • 95
  • 1
  • 10