79

I have been using PySpark with Ipython lately on my server with 24 CPUs and 32GB RAM. Its running only on one machine. In my process, I want to collect huge amount of data as is give in below code:

train_dataRDD = (train.map(lambda x:getTagsAndText(x))
.filter(lambda x:x[-1]!=[])
.flatMap(lambda (x,text,tags): [(tag,(x,text)) for tag in tags])
.groupByKey()
.mapValues(list))

When I do

training_data =  train_dataRDD.collectAsMap()

It gives me outOfMemory Error. Java heap Space. Also, I can not perform any operations on Spark after this error as it looses connection with Java. It gives Py4JNetworkError: Cannot connect to the java server.

It looks like heap space is small. How can I set it to bigger limits?

EDIT:

Things that I tried before running: sc._conf.set('spark.executor.memory','32g').set('spark.driver.memory','32g').set('spark.driver.maxResultsSize','0')

I changed the spark options as per the documentation here(if you do ctrl-f and search for spark.executor.extraJavaOptions) : http://spark.apache.org/docs/1.2.1/configuration.html

It says that I can avoid OOMs by setting spark.executor.memory option. I did the same thing but it seem not be working.

pg2455
  • 5,039
  • 14
  • 51
  • 78
  • Check this question http://stackoverflow.com/questions/21138751/spark-java-lang-outofmemoryerror-java-heap-space – Bruno Caceiro Sep 01 '15 at 16:52
  • @bcaceiro: I see lot of spark options being set in the post. I dont use scala. I am using IPython. Do you know if I can set those options from within the shell? – pg2455 Sep 01 '15 at 17:12
  • @bcaceiro : Updated the question with suggestion from the post that you directed me too. It seems like there is some problem with JVM. – pg2455 Sep 01 '15 at 18:59

4 Answers4

93

After trying out loads of configuration parameters, I found that there is only one need to be changed to enable more Heap space and i.e. spark.driver.memory.

sudo vim $SPARK_HOME/conf/spark-defaults.conf
#uncomment the spark.driver.memory and change it according to your use. I changed it to below
spark.driver.memory 15g
# press : and then wq! to exit vim editor

Close your existing spark application and re run it. You will not encounter this error again. :)

pg2455
  • 5,039
  • 14
  • 51
  • 78
  • 4
    Can you change this conf value from the actual script (ie. `set('spark.driver.memory','15g')`) ? – swdev Dec 21 '15 at 07:31
  • 1
    I tried doing it but was not successful. I think it need to restart with new global parameters. – pg2455 Dec 21 '15 at 15:48
  • 14
    From docs: spark.driver.memory "Amount of memory to use for the driver process, i.e. where SparkContext is initialized. (e.g. 1g, 2g). Note: In client mode, this config must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point. Instead, please set this through the --driver-memory command line option or in your default properties file." – Răzvan Flavius Panda May 27 '16 at 15:15
  • I was running the Spark code using SBT run from IDEA SBT Console, the fix for me was to add `-Xmx4096M -d64` to the java VM parameters that get passed on the SBT Console launch. This is under `Other settings` -> `SBT`. – Răzvan Flavius Panda May 31 '16 at 18:24
  • Spark keeps evolving. So you might have to look into its documentation and find out the configuration parameters that correlate to the memory allocation. – pg2455 Nov 30 '16 at 20:19
  • I had to create the `$SPARK_HOME/conf/spark-defaults.conf` file but it worked either way. Also, I did not need to restart Spark or anything, just relaunched my python application and the setting was immediately applied. – Manu CJ Jun 15 '23 at 08:35
53

If you're looking for the way to set this from within the script or a jupyter notebook, you can do:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master('local[*]') \
    .config("spark.driver.memory", "15g") \
    .appName('my-cool-app') \
    .getOrCreate()
louis_guitton
  • 5,105
  • 1
  • 31
  • 33
2

I had the same problem with pyspark (installed with brew). In my case it was installed on the path /usr/local/Cellar/apache-spark.

The only configuration file I had was in apache-spark/2.4.0/libexec/python//test_coverage/conf/spark-defaults.conf.

As suggested here I created the file spark-defaults.conf in the path /usr/local/Cellar/apache-spark/2.4.0/libexec/conf/spark-defaults.conf and appended to it the line spark.driver.memory 12g.

roschach
  • 8,390
  • 14
  • 74
  • 124
2

I got the same error and I just assigned memory to spark while creating session

spark = SparkSession.builder.master("local[10]").config("spark.driver.memory", "10g").getOrCreate()

or

SparkSession.builder.appName('test').config("spark.driver.memory", "10g").getOrCreate()
Prakhar Gupta
  • 471
  • 5
  • 11