What is spark spill (disk and memory both)?

Question

As per the documentation:

Shuffle spill (memory) is the size of the deserialized form of the shuffled data in memory.

Shuffle spill (disk) is the size of the serialized form of the data on disk.

My understanding of shuffle is this:

Every executor takes all the partitions on it and hashpartitions them into 200 new partitions (this 200 can be changed). Each new partition is associated with an executor that it will later on go to. For example: For each existing partition: new_partition = hash(partitioning_id)%200; target_executor = new_partition%num_executors where % is the modulo operator and the num_executors is the number of executors on the cluster.
These new partitions are dumped onto the disk of each node of their initial executors. Each new partitions will, later on, be read by the target_executor
Target executors pick up their respective new partitions (out of the 200 generated)

Is my understanding of the shuffle operation correct?

Can you help me put the definition of shuffle spill (memory) and shuffle spill (disk) in the context of the shuffle mechanism (the one described above if it is correct)? For example (maybe): "shuffle spill (disk) is the part that is happening in point 2 mentioned above where the 200 partitions are dumped to the disk of their respective nodes" (I do not know if it is correct to say that; just giving an example)

M_S · Accepted Answer · 2022-12-19T09:15:20.427

Lets take a look at docu where we can find this:

Shuffle read: Total shuffle bytes and records read, includes both data read locally and data read from remote executors

This is what your executor loads into memory when stage processing is starting, you can think about this as shuffle files prepared in previous stage by other executors

Shuffle write: Bytes and records written to disk in order to be read by a shuffle in a future stage

This is size of output of your stage which may be picked up by next stage for processing, in other words this is a size of shuffle files which this stage created

And now what is shuffle spill

Shuffle spill (memory) is the size of the deserialized form of the shuffled data in memory.
Shuffle spill (disk) is the size of the serialized form of the data on disk.

Shuffle spill hapens when your executor is reading shuffle files but they cannot fit into execution memory of this executor. When this happens, some chunk of data is removed from memory and written to disc (its spilled to disc in other words)

Moving back to your question: what is the difference between spill(memory) and spill(disc)? Its describing excatly the same chunk of data. First metric is describing space occupied by those spilled data in memory before they were moved to disc, second is describing their size when written to disc. Those two metrics may be different because data may be represented differently when written to disc, for example they may be compressed.

If you want to read more:

Cloudera questions

"Shuffle spill (memory) is the size of the deserialized form of the data in memory at the time when we spill it, whereas shuffle spill (disk) is the size of the serialized form of the data on disk after we spill it. This is why the latter tends to be much smaller than the former. Note that both metrics are aggregated over the entire duration of the task (i.e. within each task you can spill multiple times)."

Medium 1 Medium 2

Spill is represented by two values: (These two values are always presented together.)

Spill (Memory): is the size of the data as it exists in memory before it is spilled.

Spill (Disk): is size of the data that gets spilled, serialized and, written into disk and gets compressed.

Are you sure the last part of your answer is correct? I am seeing spill (disk) as ```2.4 MiB``` and spill (Memory) as ```320 MiB```. It seems highly unlikely that that size difference is just because of the compression and that the two measure the same chunk of data — figs_and_nuts, Dec 19 '22 at 08:21
Yes, i am sure and i found this information in few places. For me its completely valid that your data are 8 times smaller on disc, they are serialized and compressed, it can make huge difference. I added some blog posts to my answer if you want to read more — M_S, Dec 19 '22 at 09:13
It is actually 133 times smaller. I will go through the links. Thank you for helping me :) — figs_and_nuts, Dec 19 '22 at 10:06
haha sorry my mistake :D i thought its 300 MiB disc vs 2,4 GiB mem, sometimes i am just reading to quick :D You can add sample of your data, maybe they are specific and compression is extremely efficient. — M_S, Dec 19 '22 at 10:21

What is spark spill (disk and memory both)?

1 Answers1