Third-party cache's like Redis, EhCache etc will not help us in managing a changing cache within spark especially when the changing cache is a machine-learning active-learning based system's cache. For example, k-means grouping will have to keep on learning cluster centroids and update them after each iteration. If you are also dealing with a cache of this nature, then one can think of following two main steps for this:
- Broadcast the data to have a local cached copy per executor and
- Iteratively keep refining this broadcasted cache.
But, because broadcasted variable is final and cannot be changed, I took an alternative route to solve this where the cached data is not broadcasted. It basically involves learning local corpus per partition and aggregating partition-wise corpus into one.
please refer to my blog part1-talks about keeping track of changing cache where I've demoed how to aggregate partition-wise corpus. Hope it helps