1

I'm running a spark job on EMR with Spark 1.6 and as shown below there is enough memory available on the executors.

Spark Memory Storage Tab

Even though there is quite a lot of memory available, I see the below where shuffle spills to disk. What I'm attempting to do is a join and I'm joining the three datasets using dataframes api's

enter image description here

I did look at the documentation and also played around with "spark.memory.fraction", and "spark.memory.storageFraction", but that does not seem to help.

Any help will be greatly appreciated. Thanks

Yash Krishnan
  • 2,653
  • 1
  • 18
  • 22
Deepak
  • 1,347
  • 7
  • 20
  • In spark when there is a shuffle phase, the shuffle files(output of map phase) are written to disk only. Have a look into this [question](http://stackoverflow.com/questions/35479876/why-spark-map-phase-output-is-written-to-local-disk) – nagendra Mar 04 '16 at 13:13
  • @nagendra : That is spot on if I were on Spark < 1.6. On spark 1.6 the configs such as **spark.shuffle.memoryFraction** are deprecated and users are encouraged to use the memory.fraction and memory.storagefraction only. http://spark.apache.org/docs/latest/configuration.html. I'm trying to understand how to solve this on Spark 1.6 without going into the legacy mode – Deepak Mar 05 '16 at 06:14
  • check the different caching options for the rdds. The default persist is the `MEMORY_AND_DISK_SER` http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence – raschild Mar 06 '16 at 00:06
  • I have cached the RDD's. Don't think that is the issue here. – Deepak Mar 22 '16 at 05:19

0 Answers0