0

Does Spark shuffle write all intermediate data to disk, or only that which will not fit in memory ("spill")?

In particular, if the intermediate data is small, will anything be written to disk, or will the shuffle be performed entirely using memory without writing anything to disk?

I've checked the docs and related StackOverflow questions, but they weren't clear on this precise question.

Denziloe
  • 7,473
  • 3
  • 24
  • 34

1 Answers1

0

Answer to question in single line yes but Memory management spark 3.0 is better . unified memory management

MAP PHASE

  • During the map phase, each executor writes its output data for a given shuffle partition to local disk storage instead of sending it directly to the reducer.
  • The intermediate data is written as individual spill files, typically in a round-robin manner across multiple local disks to distribute the I/O load.
  • If the data for a single shuffle partition exceeds the executor's memory limit, it will be spilled to disk in multiple spill files.

Reduce Phase:

  • The reduce tasks fetch the spilled data partitions from the map tasks' local disks, bringing them into memory for processing.

  • The reduce tasks operate on the merged data, performing the necessary computations

Indrajit Swain
  • 1,505
  • 1
  • 15
  • 22