I would like to insert pyspark dataframe content to Redis in an effective way. Trying a couple methods but none of them are giving expected results.
Converting df to json takes 30 seconds. The goal is to SET the json payload into Redis cluster for consumption.
I'm also trying to make use of spark-redis https://github.com/RedisLabs/spark-redis/blob/master/doc/python.md library to insert the results into Redis, so that the results are inserted into Redis by all the worker nodes to see if it makes a huge difference. Even this process takes the same amount of time to get the results inserted into Redis
I'm looking for experts suggestions on how to clear my bottleneck and see if I can bring it up to less than 5 seconds, Thanks.
I'm using EMR cluster with 1+4 nodes with 16 cores and 64 Gigs memory each.
js = json.dumps(df.toJSON().collect()) #takes 29 seconds
redis.set(key1, js) #takes 1 second
df.write.format("org.apache.spark.sql.redis").option("table", key1).mode('append').save() #takes 28 seconds
the first two lines of code to convert the df into json taking 29 seconds and setting into redis taking 1 second.
or
last line of code uses worker nodes to insert the df content directly into Redis, but takes like 28 seconds.