3

I have a pyspark code that reads from a persistent store(HDFS) and creates a spark dataframe in memory. I believe it is called caching.

What i need is this: every night the pyspark should run and refresh the cache ,so that other pyspark scripts can directly read from the cache without going to the persistent store.

I understand one can use Redis to do this, but what are some other options? Kafka?

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
Victor
  • 16,609
  • 71
  • 229
  • 409
  • Kafka (or a message queue) is not really a proper replacement for a cache as you'd need to read all messages to rebuild any state... HDFS allows you to allocate data in a RAM disk, too, so there's no need for extra infrastructure, but Hbase or Kudu would be a good middle ground – OneCricketeer May 03 '21 at 13:44
  • Thank you. Just trying to understand what is a situation where Redis is appropriate? If the underlying database is a regular (non distributed) RDBMS? – Victor May 03 '21 at 20:41
  • I don't use Redis, but it stores key value pairs, not full dataframes. Fact is, you can use any persistent storage you want to, but if it's going to be a database, that can sometimes be just as slow as an HDFS disk read. What do you have against using Spark builtin checkpoints, broadcast variables, etc? – OneCricketeer May 04 '21 at 11:50
  • Did not know of broadcast variables. Thanks for suggesting. Did not understand how checkpoint can help here? Goal is reduce disk I/O, that is how the idea of a cache came to me? – Victor May 04 '21 at 20:32
  • Both persist and cache functions are built into Spark, but Disk IO is somewhat optional, and neither use any database or queue https://stackoverflow.com/questions/35127720/what-is-the-difference-between-spark-checkpoint-and-persist-to-a-disk – OneCricketeer May 04 '21 at 22:57

0 Answers0