0

I have a question about, when a RDD is stored in memory. Lets say I have this code:

val dataset = originalDataset
  .flatMap(data => modifyDatasetFormat(data, mappingsInMap))
  .persist(StorageLevel.MEMORY_AND_DISK)    

So far I have a RDD which is stored in memory of each worker node.

Questions: If I do another transformation or action to this RDD, will this persistence stop exist and I should create another one or it doesn't have anything to do with it?

If I change partitions in this RDD (e.x hash partitions) will this persistence stop exist and I should create another one or it doesn't have anything to do with it?

Thanks

Alberto Bonsanto
  • 17,556
  • 10
  • 64
  • 93
Nick
  • 2,818
  • 5
  • 42
  • 60

1 Answers1

2

If i do another transformation or action to this rdd, will this persistance stop exist

No.

If i change partitions in this rdd (e.x hash partitions) will this persistance stop exist

No.

Transformations (including repartitions) can't change existing RDD, and in particular they can't unpersist it. Of course

  1. The result of the transformation won't itself be persisted;

  2. (Incorrect, as pointed out in Jem Tucker's comment) You need to transform the persisted RDD, not the one you called persist on. I.e.

    val dataset1 = originalDataset.
      flatMap(data => modifyDatasetFormat(data, mappingsInMap))
    dataset1.persist(StorageLevel.MEMORY_AND_DISK)
    val dataset3 = dataset1.map(...)
    

    will recalculate dataset1. Instead you need

    val dataset2 = dataset1.persist(StorageLevel.MEMORY_AND_DISK)
    val dataset3 = dataset2.map(...)
    
Alexey Romanov
  • 167,066
  • 35
  • 309
  • 487
  • So in other words....for every new rdd that is created i have to persist it and unpersist the old one.Is that right? – Nick Nov 22 '15 at 18:00
  • I believe your point 2 is incorrect, when you call persist on an rdd the return value is the 'this' reference, so both your examples are in fact effectively the same. – Jem Tucker Nov 22 '15 at 18:46
  • Lets say you have this: val dataset2 = dataset1.persist(StorageLevel.MEMORY_AND_DISK) val dataset3 = dataset2.foreach(...). If you do a transformation on the dataset2 then you have to persist it and pass it to dataset3. Am i correct? – Nick Nov 22 '15 at 19:14
  • You should only need to persist datasets that are used multiple times – Jem Tucker Nov 22 '15 at 21:04
  • @Nick No. If you had to do this every time, why wouldn't Spark do it automatically? The point of persisting an RDD is to avoid recalculating it; if it's only used for one action or transformation, it wouldn't be recalculated anyway. – Alexey Romanov Nov 22 '15 at 22:15
  • @JemTucker Yes, you are right. I misremembered something there. – Alexey Romanov Nov 22 '15 at 22:24