1

We recently started caching RDD that reused multiple times even if those RDD don't take a long time to compute.

According to the docs Spark will automatically evict the unused cached data using a LRU strategy.

So is there any drawback of overcaching RDDs? I was thinking that maybe that having all that deserialized data in memory could put more pressure on the GC but is this something that we should worry about?

zero323
  • 322,348
  • 103
  • 959
  • 935
danielz
  • 1,767
  • 14
  • 20

1 Answers1

1

The main drawback of caching a large amount of RDDs is (obviously) that it uses memory. If the cache is limíted in size, the LRU strategy doesn't necessarily mean that the least valuable items are evicted. If you are caching everything without regard to its value, you may find that more computationally costly but infrequently accessed items are evicted when you don't want them to be.

ᴇʟᴇvᴀтᴇ
  • 12,285
  • 4
  • 43
  • 66