50

I am using Spark 1.4 for my research and struggling with the memory settings. My machine has 16GB of memory so no problem there since the size of my file is only 300MB. Although, when I try to convert Spark RDD to panda dataframe using toPandas() function I receive the following error:

serialized results of 9 tasks (1096.9 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)

I tried to fix this changing the spark-config file and still getting the same error. I've heard that this is a problem with spark 1.4 and wondering if you know how to solve this. Any help is much appreciated.

ahajib
  • 12,838
  • 29
  • 79
  • 120

7 Answers7

58

You can set spark.driver.maxResultSize parameter in the SparkConf object:

from pyspark import SparkConf, SparkContext

# In Jupyter you have to stop the current context first
sc.stop()

# Create new config
conf = (SparkConf()
    .set("spark.driver.maxResultSize", "2g"))

# Create new context
sc = SparkContext(conf=conf)

You should probably create a new SQLContext as well:

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
zero323
  • 322,348
  • 103
  • 959
  • 935
28

From the command line, such as with pyspark, --conf spark.driver.maxResultSize=3g can also be used to increase the max result size.

Dolan Antenucci
  • 15,432
  • 17
  • 74
  • 100
  • this works best for me as in-session SparkContext restart requires authentication again – alisa Aug 02 '19 at 22:28
12

Tuning spark.driver.maxResultSize is a good practice considering the running environment. However, it is not the solution to your problem as the amount of data may change time by time. As @Zia-Kayani mentioned, it is better to collect data wisely. So if you have a DataFrame df, then you can call df.rdd and do all the magic stuff on the cluster, not in the driver. However, if you need to collect the data, I would suggest:

  • Do not turn on spark.sql.parquet.binaryAsString. String objects take more space
  • Use spark.rdd.compress to compress RDDs when you collect them
  • Try to collect it using pagination. (code in Scala, from another answer Scala: How to get a range of rows in a dataframe)

    long count = df.count() int limit = 50; while(count > 0){ df1 = df.limit(limit); df1.show(); //will print 50, next 50, etc rows df = df.except(df1); count = count - limit; }

Community
  • 1
  • 1
Iraj Hedayati
  • 1,478
  • 17
  • 23
9

Looks like you are collecting the RDD, So it will definitely collect all the data to driver node that's why you are facing this issue. You have to avoid collect data if not required for a rdd, or if its necessary then specify spark.driver.maxResultSize. there are two ways of defining this variable

1 - create Spark Config by setting this variable as
conf.set("spark.driver.maxResultSize", "3g")
2 - or set this variable in spark-defaults.conf file present in conf folder of spark. like spark.driver.maxResultSize 3g and restart the spark.

Zia Kiyani
  • 812
  • 5
  • 21
  • I did set the variable in the config file and restarted the spark but still getting the same error. – ahajib Jun 25 '15 at 20:52
  • It worked for me, but that should be temporary solution like you mentioned ;) thank you any way – imanis_tn Sep 08 '15 at 12:25
  • The first one won't work in `client mode` because the JVM would have been started already. – makansij Jul 15 '16 at 04:45
  • It works bro, Please test first then rate the answer down. See the reply to this question that is marked as answer. Setting the property in code. – Zia Kiyani Jul 18 '16 at 11:24
3

while starting the job or terminal, you can use

--conf spark.driver.maxResultSize="0"

to remove the bottleneck

Mike
  • 197
  • 1
  • 2
  • 15
2

There is also a Spark bug https://issues.apache.org/jira/browse/SPARK-12837 that gives the same error

 serialized results of X tasks (Y MB) is bigger than spark.driver.maxResultSize

even though you may not be pulling data to the driver explicitly.

SPARK-12837 addresses a Spark bug that accumulators/broadcast variables prior to Spark 2 were pulled to driver unnecessary causing this problem.

Tagar
  • 13,911
  • 6
  • 95
  • 110
0

You can set spark.driver.maxResultSize to 2GB when you start the pyspark shell:

pyspark  --conf "spark.driver.maxResultSize=2g"

This is for allowing 2Gb for spark.driver.maxResultSize

korahtm
  • 1
  • 1