13

Say I have three RDD transformation function called on rdd1:

def rdd2 = rdd1.f1
def rdd3 = rdd2.f2
def rdd4 = rdd3.f3

Now I want to cache rdd4, so I call rdd4.cache().

My question:

Will only the result from the action on rdd4 be cached or will every RDD above rdd4 be cached? Say I want to cache both rdd3 and rdd4, do I need to cache them separately?

gsamaras
  • 71,951
  • 46
  • 188
  • 305
EdwinGuo
  • 1,765
  • 2
  • 21
  • 27

1 Answers1

21

The whole idea of cache is that spark is not keeping the results in memory unless you tell it to. So if you cache the last RDD in the chain it only keeps the results of that one in memory. So, yes, you do need to cache them separately, but keep in mind you only need to cache an RDD if you are going to use it more than once, for example:

rdd4.cache()
val v1 = rdd4.lookup("key1")
val v2 = rdd4.lookup("key2")

If you do not call cache in this case rdd4 will be recalculated for every call to lookup (or any other function that requires evaluation). You might want to read the paper on RDD's it is pretty easy to understand and explains the ideas behind certain choices they made regarding how RDD's work.

michalis
  • 98
  • 6
aaronman
  • 18,343
  • 7
  • 63
  • 78
  • Appreciate your answer. So whenever there will be a fork, you need to cache that rdd to reduce the repetitive computation. The only pain is to unpersist on the cached rdd (since I have multiple-fork on my rdd transformation). I will read the paper again. Thanks – EdwinGuo Sep 02 '14 at 16:35
  • @EdwinGuo don't quote me on this but I think most people find that taking the extra time to unpersist is usually more trouble than it's worth, it's better to let the JVM handle this as unresisting is a very expensive operation – aaronman Sep 02 '14 at 16:39
  • ok, should I open up another question regarding that? trying to search on unpersist, no luck. "Mark the RDD as non-persistent, and remove all blocks for it from memory and disk." from the gitHub, did not mention to much – EdwinGuo Sep 02 '14 at 16:57
  • @EdwinGuo if you need to, I would search the [spark user group](http://apache-spark-user-list.1001560.n3.nabble.com) before asking – aaronman Sep 02 '14 at 17:16
  • 1
    Another approach to caching i heard about is more recent versions of spark may support automatic unpersisting using user defined priorities e.g. FIFO – samthebest Sep 03 '14 at 04:36
  • "Removing Data Spark automatically monitors cache usage on each node and drops out old data partitions in a least-recently-used (LRU) fashion. If you would like to manually remove an RDD instead of waiting for it to fall out of the cache, use the RDD.unpersist() method." didn't mention too much about the cost of unpersist. Do you think use off-heap would be a better solution to cache rdd, like TackYon? – EdwinGuo Sep 05 '14 at 01:49
  • @EdwinGuo I don't have any experience with Tachyon though it does seem like an interesting technology – aaronman Sep 05 '14 at 01:53
  • paper is down, plus how much time before .cache dies by it self? – Raul H Nov 09 '16 at 22:30