4

I've noticed strange behavior when running a pyspark application with spark 2.0. In the first step in my script involving a reduceByKey (and thus shuffle) operation, I observe that the amount the shuffle writes is roughly in line with my expectations, but that much more spills occur than I had expected. I tried to avoid these spills by increasing the amount of memory assigned per executor up to 8x the original amount, but see basically no difference in the amount spilled. Strangely, I also see that while this stage is running, hardly any of the assigned storage memory is used (as reported in the executors tab in the spark web UI).

I saw this earlier question, which led me to believe that increasing executor memory might help avoid the spills: How to optimize shuffle spill in Apache Spark application . This leads me to believe that some hard limit is leading to the spills, and not the spark.shuffle.memoryFraction parameter. Does such a hard limit exist, possibly among HDFS parameters? Otherwise, what could be done to avoid spills besides increasing executor memory?

Many thanks, R

Tasks view inside reduce job showing spills Executors tab showing low memory use job DAG

Community
  • 1
  • 1
roro
  • 177
  • 8

1 Answers1

2

Spilling behavior in PySpark is controlled using spark.python.worker.memory:

Amount of memory to use per python worker process during aggregation, in the same format as JVM memory strings (e.g. 512m, 2g). If the memory used during aggregation goes above this amount, it will spill the data into disks.

which is by default set to 512MB. Moreover PySpark uses its own reducing mechanism with External(GroupBy|Sorter|Merger) and exhibits slightly different behavior than its native counterpart.

zero323
  • 322,348
  • 103
  • 959
  • 935