how to mapping 100 million string into 100 thousand int?

Question

I have about 100 million simple key-value pairs(it's legacy data, never need to update, and keys are random string), and i want to store them in redis for query.

my thought is that i use the first four character as a hash key, and store them into a hash type, so there're about a million hash key in redis, with each hash key has about 1000 sub-keys.

but things just don't go as planed. for some reason, i found some hash keys only have one sub-key, but some have more than 500,000 sub-keys, which may not encoded in memory very efficiently.

so i'd like to know that is there are some simple understandable algorithm which can divide my 100 million string averagely into 100 thousand buckets(int). when I pick up a string, I can know where it goes by using the same algorithm.

thanks!!

How about using a Trie (https://en.wikipedia.org/wiki/Trie) to store all the keys? — 6324, Apr 08 '16 at 02:54
are you saying that some prefixes only occur once while others occur 500k times? — FuzzyTree, Apr 08 '16 at 04:29

score 4 · Accepted Answer · edited May 23 '17 at 12:06

Using only a small portion of the string to compute the hash function can be a problem because your strings could, for example, all share the same prefix.

There is a description of string hash functions which take the entire string at http://www.javamex.com/tutorials/collections/hash_function_technical_2.shtml and Good Hash Function for Strings (actually they give two different descriptions of the same function).

One way to look at this is that it regards the characters of a string as the coefficients A,B,C of a polynomial of the form A + Bx + Cx^2 + Dx^3... where in this case x is 31 and arithmetic is modulo 2^32. If x is well chosen then this is a scheme with which there is a lot of experience and some maths may apply which gives it good properties. Even better is to do the arithmetic modulo the size of the hash table, and to chose the size of the hash table to be a prime. If your data is static, it might be worth trying a few different primes of around your preferred table size and a few different values of x, and pick the combination which gives you the most evenly populated table.

how to mapping 100 million string into 100 thousand int?

1 Answers1