I have a pyspark code that reads from a persistent store(HDFS) and creates a spark dataframe in memory. I believe it is called caching.
What i need is this: every night the pyspark should run and refresh the cache ,so that other pyspark scripts can directly read from the cache without going to the persistent store.
I understand one can use Redis to do this, but what are some other options? Kafka?