Spark materializes its results on disk after a shuffle. While running an experiment, I saw that a task of Spark read materialized data of 65MB in 1ms (some tasks were even showed to read this in 0ms :)). My question is how can Spark read data from HDD so fast? Is it actually reading this data from a file or from memory?
The answer by @zero323 on this Stackoverflow post states To disk are written shuffle files. It doesn't mean that data after the shuffle is not kept in memory.
But I couldn't find any official Spark source that says that Spark keeps shuffle output in memory which is preferred while reading by the next task.
Is the Spark task reading shuffle output from disk or from memory (if from memory, I would be thankful if someone can point to an official source).