# no actual caching at the end of this statement
rdd1=sc.read('myfile.json').rdd.map(lambda row: myfunc(row)).cache()
# again, no actual caching yet, because Spark is lazy, and won't evaluate anything unless
# a reduction op
rdd2=rdd2.map(mysecondfunc)
# caching is done on this reduce operation. Result of rdd1 will be cached in the memory of each worker node
n=rdd1.count()
So to answer your question
If we have performed a cache() on an RDD its value is cached only on those nodes where actually RDD was computed initially
The only possibility of caching something is on worker nodes, and not on driver nodes.
cache
function can only be applied to an RDD
(refer), and since RDD
only exists on the worker node's memory (Resilient Distributed Datasets!), it's results are cached in the respective worker node memory. Once you apply an operation like count
which brings back the result to the driver, it's not really an RDD
anymore, it's merely a result of computation done RDD by the worker nodes in their respective memories
Since cache
in the above example was called on rdd2
which is still on multiple worker nodes, the caching only happens on the worker node's memory.
In the above example, when do some map-red op on rdd1
again, it won't read the JSON again, because it was cached
FYI, I am using the word memory
based on the assumption that the caching level is set to MEMORY_ONLY
. Of course, if that level is changed to others, Spark will cache to either memory
or storage
based on the setting