I was under the impression that both RDD execution and caching are lazy: Namely, if an RDD is cached, and only part of it was used, then the caching mechanism will only cache that part, and the other part will be computed on-demand.
Unfortunately, the following experiment seems to indicate otherwise:
val acc = new LongAccumulator()
TestSC.register(acc)
val rdd = TestSC.parallelize(1 to 100, 16).map { v =>
acc add 1
v
}
rdd.persist()
val sliced = rdd
.mapPartitions { itr =>
itr.slice(0, 2)
}
sliced.count()
assert(acc.value == 32)
Running it yields the following exception:
100 did not equal 32
ScalaTestFailureLocation:
Expected :32
Actual :100
Turns out the entire RDD was computed instead of only the first 2 items in each partition. This is very inefficient in some cases (e.g. when you need to determine whether the RDD is empty quickly). Ideally, the caching manager should allow the caching buffer to be incrementally written and accessed randomly, does this feature exists? If not, what should I do to make it happen? (preferrably using existing memory & disk caching mechanism)
Thanks a lot for your opinion
UPDATE 1 It appears that Spark already has 2 classes:
- ExternalAppendOnlyMap
- ExternalAppendOnlyUnsafeRowArray
that supports more granular caching of many values. Even better, they don't rely on StorageLevel, instead make its own decision which storage device to use. I'm however surprised that they are not options for RDD/Dataset caching directly, rather than for co-group/join/streamOps or accumulators.