Why is Spark's read from file after a stage so fast?

Question

Spark materializes its results on disk after a shuffle. While running an experiment, I saw that a task of Spark read materialized data of 65MB in 1ms (some tasks were even showed to read this in 0ms :)). My question is how can Spark read data from HDD so fast? Is it actually reading this data from a file or from memory?

The answer by @zero323 on this Stackoverflow post states To disk are written shuffle files. It doesn't mean that data after the shuffle is not kept in memory. But I couldn't find any official Spark source that says that Spark keeps shuffle output in memory which is preferred while reading by the next task.

Is the Spark task reading shuffle output from disk or from memory (if from memory, I would be thankful if someone can point to an official source).

score 0 · Answer 1 · answered Oct 29 '19 at 04:21

Spark shuffle outputs are written to disk. You can find this on Spark Documents on Performance Impact topic.

Shuffle also generates a large number of intermediate files on disk. As of Spark 1.3, these files are preserved until the corresponding RDDs are no longer used and are garbage collected.
This is done so the shuffle files don’t need to be re-created if the lineage is re-computed. Garbage collection may happen only after a long period time, if the application retains references to these RDDs or if GC does not kick in frequently.
This means that long-running Spark jobs may consume a large amount of disk space.

I am not contesting that Spark writes its stage output to file. My question is about whether the next stage reads from these files or directly from the data in memory. — AvinashK, Oct 29 '19 at 08:01

Why is Spark's read from file after a stage so fast?

1 Answers1