In Apache Spark, can I incrementally cache an RDD partition?

Question

I was under the impression that both RDD execution and caching are lazy: Namely, if an RDD is cached, and only part of it was used, then the caching mechanism will only cache that part, and the other part will be computed on-demand.

Unfortunately, the following experiment seems to indicate otherwise:

      val acc = new LongAccumulator()
      TestSC.register(acc)

      val rdd = TestSC.parallelize(1 to 100, 16).map { v =>
        acc add 1
        v
      }

      rdd.persist()

      val sliced = rdd
        .mapPartitions { itr =>
          itr.slice(0, 2)
        }

      sliced.count()

      assert(acc.value == 32)

Running it yields the following exception:

100 did not equal 32
ScalaTestFailureLocation: 
Expected :32
Actual   :100

Turns out the entire RDD was computed instead of only the first 2 items in each partition. This is very inefficient in some cases (e.g. when you need to determine whether the RDD is empty quickly). Ideally, the caching manager should allow the caching buffer to be incrementally written and accessed randomly, does this feature exists? If not, what should I do to make it happen? (preferrably using existing memory & disk caching mechanism)

Thanks a lot for your opinion

UPDATE 1 It appears that Spark already has 2 classes:

ExternalAppendOnlyMap
ExternalAppendOnlyUnsafeRowArray

that supports more granular caching of many values. Even better, they don't rely on StorageLevel, instead make its own decision which storage device to use. I'm however surprised that they are not options for RDD/Dataset caching directly, rather than for co-group/join/streamOps or accumulators.

Not particularly. scala Stream is already a lazily cached in-memory partition, adding the disk-spilling & automatic retry/failover you'll have the described behaviour — tribbloid, Apr 16 '20 at 20:26
In addition: I never ask Spark to be 'clairvoyant' to determine the nature of a partition: It is always an iterator, no exception, caching it as if it is a black box doesn't make things much more efficient — tribbloid, Apr 16 '20 at 20:31

thebluephantom · Answer 1 · 2020-04-20T12:46:58.457

In hindsight interesting, here is my take:

You cannot cache incrementally. So the answer to your question is No.
The persist is RDD for all partitions of that RDD, used for multiple Actions or single Action with multiple processing from same common RDD phase onwards.
The rdd Optimizer does not look to see how that could be optimized as you state - if you use the persist. You issued that call, method, api, so it executes it.
But, if you do not use the persist, the lazy evaluation and fusing of code within Stage, seems to tie the slice cardinality and the acc together. That is clear. Is it logical, yes as there is no further reference elsewhere as part of another Action. Others may see it as odd or erroneous. But it does not imply imo incremental persistence / caching.

So, imho, interesting observation I would not have come up with, and not convinced it proves anything about partial caching.

The persist was there to exactly demonstrate the difference. If it is not there the value of acc will be 32 as expected — tribbloid, Apr 18 '20 at 23:54

In Apache Spark, can I incrementally cache an RDD partition?

1 Answers1

Linked