Shuffling in spark is (as per my understanding):
- Identify the partition that the records have to go to (Hashing and modulo)
- Serialize the data that needs to go to the same partition
- transmit the data
- The data gets deserialized and read by the executors on the other end
I have a question about this:
- How is the data transmitted between the executors? Even if we have the space available in Memory. Let us assume our execution memories are 50GiB per executor and the entire data to be shuffled is just 100 MB. Is the data transmission from Storage memory (exec 1) to Storage memory (exec 2) or are there disk writes involved as intermediate steps?