1

As per Spark

"Shuffle Write" is actually meant as the sum of all written serialized data on all executors before transmitting (normally at the end of a stage)

My question is Where does the shuffle write happens ? Does in write the whole data to be shuffled on local disk alone ? or Does it write the whole data to be shuffled on RAM memory alone ? or Based on availability of RAM , Does it write some portion of data to be shuffled in Disk and some portion to RAM?

Please explain

Surender Raja
  • 3,553
  • 8
  • 44
  • 80
  • both, in-memory and disk. Here we go: https://de.slideshare.net/colorant/spark-shuffle-introduction - or if you're eager, in the original Spark paper Chapter 5: http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf – UninformedUser Apr 17 '20 at 14:09
  • See my Bountied Answer and add an upvote. I could redo it here but that is not the spirit of SO. https://stackoverflow.com/questions/58699907/spark-disk-i-o-on-stage-boundaries-explanation – thebluephantom Apr 17 '20 at 14:53
  • As per your answer i come to conclusion that map outputs are written to local disk only eventhough the data that need to shuffled is small – Surender Raja Apr 17 '20 at 15:04
  • Indeed that is the paradigm. – thebluephantom Apr 18 '20 at 10:04

1 Answers1

-2

So by default spark caching is in memory and if data is not enough to fit in memory then it will spill on disk. Now, when we talk about the shuffle-data which will be the intermediate result/output from mapper. By default, the spark will store this intermediate output in memory but if there is not enough space then it will store the intermediate data on the disk space. Spark will store this data in a serialized format so that it doesn't have to incur the cost of deserialization every time.

code.gsoni
  • 695
  • 3
  • 12