I know that
The output of the Mapper (intermediate data) is stored on the Local file system (not HDFS) of each individual mapper data nodes. This is typically a temporary directory which can be setup in config by the Hadoop administrator. Once the Mapper job completed or the data transferred to the Reducer, these intermediate data is cleaned up and no more accessible.
But, I wanted to know when does a mapper store its output to its local hard disk? Is it because the data is too large for it to be in memory? And only the data which is being processed remains in the memory? If the data is small and the whole data can fit in memory, then no disk involvement is there?
Can we not directly move the data, once it is processed in the mapper, from the mapper to reducer without the involvement of the hard disk of the mapper m/c. I mean as the data is being processed in the mapper, and it is in memory, once it is computed, it is directly transferred to the reducer and the mapper could pass on the next chunk of data similarly with no disk involvement.
In spark, it is said there is in-memory computation, How different is that from above? What makes spark compute in-memory better than map reduce? Also, in spark there would have to be disk involvement, if data is too huge?
Please explain