Description
I have a rather large set of (string, string, string) unique tuples (around 40mln but can get bigger). For tuple each I compute a single unsigned int value. I would like to store those values somewhere so after generating them they could be reused (even after the application goes down so in memory storage is out of the question, so are databases unfortunately).
At first I stored them in a file as a tuple (string, string, string, value) but reading in 40mln records takes time (and I need it almost instantly).
I decided to first compute the hash value of each (string, string, string) tuple then to normalize it to [0, n] (where n is the number of values) and store only the values in a binary file in a sorted order (sorted by the normalized hash value). After that I can simply mmap() this file and get the values with mmap[normalize(hash(string, string, string))].
My hash function is pretty straightforward but fast and works in my case (didn't notice any collisions):
concatenatedString = s1+"."+s2+"."+s3
unsigned int hash = 31;
for(int i = 0; i < concatenatedString.length(); i++) {
hash = hash * 101 + (unsigned int) concatenatedString[i];
}
Same with normalization (straightforward):
((long) n * hash) / max_value
n - the upper bound of my normalized range (so around 40mln, I take n not (n - lower_bound) because lowe_bound = 0)
max_value - maximum value of the old set (UINT_MAX in my case, min_value = 0 so I do not include it into the equation)
Problem
My hash function doesn't produce uniformly distributed values (don't see how it even could do that) in the range of 0 to 4,294,967,295 (unsigned int). Because of this after normalization I have quite a few collisions which result in data loss (overwriting values under the same array index).
Are there any clever ways to do what I want to do but without those collisions?
I am fully aware some collision might occur. The thing is with my approach they tend to occur way too often. My hashing range is 100 times bigger than the number of my elements so I'm guessing there might be a way to do this but I haven't yet figured out how.
Solution In the end I changed my hash to Murmurhash, changed my normalizing method to a simple "modulo newRange" and changed the format of the file (I store all the data now (string string string value)) - the file is pretty big now but thanks to that I was able to implement a simple collision detection mechanism (double hashing).