0

I am retrieving a big set of data from BigQuery and performing some sort of data transformation on it and storing my data in a Panda dataframe.

Since this data doesn't need to be retrieved again from the database every time, I want to cache it to avoid recalling the database and performing the same sort of transformation again.

Let's assume that the data is bigger than 512 MB which is the string value limitation in Redis.

I am considering using Redis Cluster to distribute this caching process, but I don't know how I should store the data.

There are two ways that I can think of for this purpose:

  1. Based on this thread and this one, we can compress the dataframe using zlib and store it in a key. In this case, I am not sure when the data is greater than 512MB, Redis is automatically splitting it in the cluster nodes?
  2. Storing each row of the dataframe as a key in the Redis. In this case, I am not sure how to read the data back as a Panda dataframe again.
Amirsalar
  • 648
  • 9
  • 18
  • The second option seems much cleaner, how do you plan to retrieve the frames? by key all the keys each time? do you want to filter by column? – Guy Korland Jul 25 '21 at 10:31
  • 1
    As far as point number 1. Redis cluster does not split big values from a key among all members. Redis cluster distribute the keys (along with their value) among cluster members. More about this, https://redis.io/topics/cluster-tutorial. – usuario Jul 25 '21 at 14:22
  • @GuyKorland I want to store all the information on each row (not filtering anything). So, i think if my dataframes has 5 million records, I will be storing 5 million keys each being a row id ( or index). The challenge now is how I should retrieve all those data in the same order back into the dataframe. – Amirsalar Jul 26 '21 at 01:10
  • Perhaps Redis Streams is the right thing for you https://redis.io/topics/streams-intro – Guy Korland Jul 27 '21 at 07:10

0 Answers0