4

Lets say i have the following:

 val dataset2 = dataset1.persist(StorageLevel.MEMORY_AND_DISK) 
 val dataset3 = dataset2.map(.....)

If you do a transformation on the dataset2 then you have to persist it and pass it to dataset3 and unpersist the previous or not?

I am trying to figure out when to persist and unpersist RDDs. With every new rdd that is created do i have to persist it?

Thanks

Nick
  • 2,818
  • 5
  • 42
  • 60
  • Refer to http://stackoverflow.com/questions/28981359/why-do-we-need-to-call-cache-or-persist-on-a-rdd and http://stackoverflow.com/questions/29903675/understanding-sparks-caching – Elena Viter Nov 22 '15 at 20:44
  • Not helpful....i need an answer specifically for this . Thank you. – Nick Nov 22 '15 at 20:48
  • 4
    The RDD lineage is a graph of linked RDD objects, each node is aware of its dependencies. Caching breaks the lineage, the RDD after this "caches" its content, and all dependent RDDs down the lineage tree can reuse that cached data. In your case, there's no effect at all (linear lineage) - all nodes will be vsited only once. All lazy operations (map in your case), including persist operation, will be evaluated only on materialization step. You need persist when you have the "tree-like" lineage or run operations on your rdd in a loop - to avoid rdd re-evaluation – Elena Viter Nov 22 '15 at 20:58
  • So if i do this val dataset3 = dataset2.foreach(...) the result of this action is an rdd (dataset3) whicw can use the cached data am i right? – Nick Nov 22 '15 at 21:19
  • 2
    Exactly. map IS NOT an action (lazy evaluation), foreach IS an action and will trigger cache to run before the foreach. – Elena Viter Nov 22 '15 at 21:23
  • 1
    http://stackoverflow.com/questions/32636822/would-spark-unpersist-the-rdd-itself-when-it-realizes-it-wont-be-used-anymore can also be relevant. In general I'd suggest not worrying about persistence. Just write the code. Then if you need to improve the performance you can experiment with caching. It may increase or decrease performance. You need to benchmark. – Daniel Darabos Nov 22 '15 at 23:40
  • @Nick I think that Daniel's and Elena's comments are very useful and you should take it into consideration. Caching (persisting) and un-persisting is performance related and there is no secret recipes for that. You'll need to benchmark and test. But the rule of thumb is `when you want to access the data multiple times, you cache it if you have space` – eliasah Nov 23 '15 at 09:37

1 Answers1

0

Spark automatically monitors cache usage on each node and drops out old data partitions in a least-recently-used (LRU) fashion. If you would like to manually remove an RDD instead of waiting for it to fall out of the cache, use the RDD.unpersist() method.

Refrence from: http://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence

wanbo
  • 77
  • 1
  • 9