I've noticed strange behavior when running a pyspark application with spark 2.0. In the first step in my script involving a reduceByKey (and thus shuffle) operation, I observe that the amount the shuffle writes is roughly in line with my expectations, but that much more spills occur than I had expected. I tried to avoid these spills by increasing the amount of memory assigned per executor up to 8x the original amount, but see basically no difference in the amount spilled. Strangely, I also see that while this stage is running, hardly any of the assigned storage memory is used (as reported in the executors tab in the spark web UI).
I saw this earlier question, which led me to believe that increasing executor memory might help avoid the spills: How to optimize shuffle spill in Apache Spark application . This leads me to believe that some hard limit is leading to the spills, and not the spark.shuffle.memoryFraction parameter. Does such a hard limit exist, possibly among HDFS parameters? Otherwise, what could be done to avoid spills besides increasing executor memory?
Many thanks, R