Write to cache using pyspark that is shared with other pyspark processes

Question

I have a pyspark code that reads from a persistent store(HDFS) and creates a spark dataframe in memory. I believe it is called caching.

What i need is this: every night the pyspark should run and refresh the cache ,so that other pyspark scripts can directly read from the cache without going to the persistent store.

I understand one can use Redis to do this, but what are some other options? Kafka?

Kafka (or a message queue) is not really a proper replacement for a cache as you'd need to read all messages to rebuild any state... HDFS allows you to allocate data in a RAM disk, too, so there's no need for extra infrastructure, but Hbase or Kudu would be a good middle ground — OneCricketeer, May 03 '21 at 13:44
Thank you. Just trying to understand what is a situation where Redis is appropriate? If the underlying database is a regular (non distributed) RDBMS? — Victor, May 03 '21 at 20:41
I don't use Redis, but it stores key value pairs, not full dataframes. Fact is, you can use any persistent storage you want to, but if it's going to be a database, that can sometimes be just as slow as an HDFS disk read. What do you have against using Spark builtin checkpoints, broadcast variables, etc? — OneCricketeer, May 04 '21 at 11:50
Did not know of broadcast variables. Thanks for suggesting. Did not understand how checkpoint can help here? Goal is reduce disk I/O, that is how the idea of a cache came to me? — Victor, May 04 '21 at 20:32
Both persist and cache functions are built into Spark, but Disk IO is somewhat optional, and neither use any database or queue https://stackoverflow.com/questions/35127720/what-is-the-difference-between-spark-checkpoint-and-persist-to-a-disk — OneCricketeer, May 04 '21 at 22:57

Write to cache using pyspark that is shared with other pyspark processes

0 Answers0