Apache Spark 1.6 spills to disk even when there is enough memory

Asked Mar 04 '16 at 12:00

Active Mar 04 '16 at 12:43

Viewed 553 times

I'm running a spark job on EMR with Spark 1.6 and as shown below there is enough memory available on the executors.

Even though there is quite a lot of memory available, I see the below where shuffle spills to disk. What I'm attempting to do is a join and I'm joining the three datasets using dataframes api's

I did look at the documentation and also played around with "spark.memory.fraction", and "spark.memory.storageFraction", but that does not seem to help.

Any help will be greatly appreciated. Thanks

edited Mar 04 '16 at 12:43

Yash Krishnan

2,653
1
18
22

asked Mar 04 '16 at 12:00

Deepak

1,347
7
20

In spark when there is a shuffle phase, the shuffle files(output of map phase) are written to disk only. Have a look into this [question](http://stackoverflow.com/questions/35479876/why-spark-map-phase-output-is-written-to-local-disk) – nagendra Mar 04 '16 at 13:13
@nagendra : That is spot on if I were on Spark < 1.6. On spark 1.6 the configs such as **spark.shuffle.memoryFraction** are deprecated and users are encouraged to use the memory.fraction and memory.storagefraction only. http://spark.apache.org/docs/latest/configuration.html. I'm trying to understand how to solve this on Spark 1.6 without going into the legacy mode – Deepak Mar 05 '16 at 06:14
check the different caching options for the rdds. The default persist is the `MEMORY_AND_DISK_SER` http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence – raschild Mar 06 '16 at 00:06
I have cached the RDD's. Don't think that is the issue here. – Deepak Mar 22 '16 at 05:19

Apache Spark 1.6 spills to disk even when there is enough memory

0 Answers0