How does the LRU policy for RDD cache eviction work in apache spark?

Question

(1) How Spark decide which partitions be evicted for a RDD ?

(2) What's the relation between LRU and RDD StorageLevel ?

(3) If data source size is very big(bigger than sum of all executor memory), How spark load the data and create RDD ? Is it related to LRU ?

I created this question aiming at getting some details about RDD LRU eviction, StorageLevel.

score 3 · Answer 1 · answered Sep 06 '17 at 10:03

1.The default RDD eviction strategy is LRU.When memory space is not sufficient for RDD caching, several partitions will be evicted, if these partitions are used again latterly, they will be reproduced by the Lineage information and cached in memory again. Cached datasets that do not fit in memory are either spilled to disk or recomputed on the fly when needed, as determined by the RDD's storage level

2.I haven't found anything about the relationship between the LRU and RDD storageLevel.However, you can use different storageLevel to cache data if doesn't fit into the memory.Also among different storageLevel, MEMORY_AND_DISK_SER can help cut down on GC and avoid expensive recomputations.

3.I don't think so there will be any issue if you're running spark on data that is larger than the sum of all executor memory or cluster size.Many operations can stream data through, and thus memory usage is independent of input data size.In few cases if the job fails or if an individual partition becomes too large to fit in memory than the usual approach would be to repartition to more partitions, so each one is smaller. Hopefully, then it would fit.

How does the LRU policy for RDD cache eviction work in apache spark?

1 Answers1