4

I wrote below bash script to process redis key and value. I've around 45-50 millions of keys in my Redis. I want to retrieve all values and do some processing. To do that my below script is taking 1 hour to process 1 millions of key. In order to process 50 million key it will take 50 hrs which I don't want to do that. I'm new to redis cli - can someone please help me to optimize the below script or it would be really greatful if someone can provide some suggestion.

My Redis key-value pattern:

Keys - 123.item.media
Values - 93839,abc,98,829 | 38282,yiw,282,282 | 8922,dux,382,993 |

Keys - 234.item.media
Values - 2122,eww,92,211 | 8332,uei,902,872 | 9039,uns,892,782 |

Keys - 839.item.media
Values - 7822,nkp,77,002 | 7821,mko,999,822 |

In below script I'm passing all my keys and calculating how much record I have for each key. For example - this key (123.item.media) has 3 records and this one (839.item.media) has two records.

So for bove keys and values, the output should be: Total Count: 8

Same way I'm doing for all 50 millions keys - which is taking too much time.

My code:

#!/bin/sh
cursor=-1
keys=""
recordCount=0
while [ $cursor -ne 0 ];
do
        if [ $cursor -eq -1 ]
        then
        cursor=0
    fi
    reply=`redis-cli SCAN $cursor MATCH "*" COUNT 100`
    #echo $reply
    cursor=`expr "$reply" : '\([0-9]*[0-9 ]\)'`
    keys=${reply#[0-9]*[[:space:]]}
    for i in $keys
    do
    #echo $i
    #echo $keys
    value=$(redis-cli GET $i)
    temCount=`echo $value | awk -F\| '{print NF}'`
    #echo $temCount
    recordCount=`expr ${temCount} + ${recordCount}`
    done
done

echo "Total Count: " $recordCount

Appreciate your help in advance!

learn java
  • 231
  • 3
  • 14
  • Consider doing this in another language because you are creating new processes and connections for every command - try C, C++, Python, Perl, PHP or somesuch. – Mark Setchell Oct 28 '17 at 21:24

2 Answers2

2

You are forking too many times in the loop, even for simple things like arithmetic which can be accomplished by Bash builtins. When you have such things in a loop that runs a few million times, it will slow things down. For example:

  • cursor=$(expr "$reply" : '\([0-9]*[0-9 ]\)')
  • temCount=$(echo $value | awk -F\| '{print NF}')
  • recordCount=$(expr ${temCount} + ${recordCount})

I am not a redis expert. Based on my cursory understanding of redis-cli, you could do this:

redis-cli --scan | sort -u > all.keys
while read -r key; 
  value=$(redis-cli get "$key")
  # do your processing
done < all.keys

If this doesn't speed things up, the next idea would be to split all.keys file into chunks of a few thousand lines and run a parallel loop for each subset of the keys. If this doesn't run fast enough, I recommend exploring the mget command and change the loop so that we retrieve the values in batches and not one by one.

Also, Bash may not be the best choice for this. I am sure there are better ways to do this in Python or Ruby.

codeforester
  • 39,467
  • 16
  • 112
  • 140
1

A lot of your time is getting wasted in 50 million network calls for 50 million keys as per this line:

value=$(redis-cli GET $i)

To do bulk querying, you can just append the GET commands in a list of say 1000, and do a bulk query using --pipe option.

  --pipe             Transfer raw Redis protocol from stdin to server.
  --pipe-timeout <n> In --pipe mode, abort with error if after sending all data.
                     no reply is received within <n> seconds.

Example of mass insert is given here on redis official documentation, you can derive bulk reads on similar lines.

This will surely give you the required boost and convert your script to couple of hours instead of 50 hours. You can tweak the value of your bulk list to 1000,10000 or 100000 to see what works best based on your value data size.

DhruvPathak
  • 42,059
  • 16
  • 116
  • 175