3

Description

I have a rather large set of (string, string, string) unique tuples (around 40mln but can get bigger). For tuple each I compute a single unsigned int value. I would like to store those values somewhere so after generating them they could be reused (even after the application goes down so in memory storage is out of the question, so are databases unfortunately).

At first I stored them in a file as a tuple (string, string, string, value) but reading in 40mln records takes time (and I need it almost instantly).

I decided to first compute the hash value of each (string, string, string) tuple then to normalize it to [0, n] (where n is the number of values) and store only the values in a binary file in a sorted order (sorted by the normalized hash value). After that I can simply mmap() this file and get the values with mmap[normalize(hash(string, string, string))].

My hash function is pretty straightforward but fast and works in my case (didn't notice any collisions):

concatenatedString = s1+"."+s2+"."+s3
unsigned int hash = 31;
for(int i = 0; i < concatenatedString.length(); i++) {
  hash = hash * 101 + (unsigned int) concatenatedString[i];
}

Same with normalization (straightforward):

((long) n * hash) / max_value

n - the upper bound of my normalized range (so around 40mln, I take n not (n - lower_bound) because lowe_bound = 0)

max_value - maximum value of the old set (UINT_MAX in my case, min_value = 0 so I do not include it into the equation)

Problem

My hash function doesn't produce uniformly distributed values (don't see how it even could do that) in the range of 0 to 4,294,967,295 (unsigned int). Because of this after normalization I have quite a few collisions which result in data loss (overwriting values under the same array index).

Are there any clever ways to do what I want to do but without those collisions?

I am fully aware some collision might occur. The thing is with my approach they tend to occur way too often. My hashing range is 100 times bigger than the number of my elements so I'm guessing there might be a way to do this but I haven't yet figured out how.

Solution In the end I changed my hash to Murmurhash, changed my normalizing method to a simple "modulo newRange" and changed the format of the file (I store all the data now (string string string value)) - the file is pretty big now but thanks to that I was able to implement a simple collision detection mechanism (double hashing).

Mateusz Dymczyk
  • 14,969
  • 10
  • 59
  • 94
  • Is sqlite also out of the question? It is an embedded sql engine - no servers needed. – sdkljhdf hda Feb 20 '13 at 06:49
  • Do you want a collision-free string to int hash function? No such thing exists. – n. m. could be an AI Feb 20 '13 at 06:52
  • [SQLite](http://www.sqlite.org/) as mentioned by lego, or one of the modern "nosql" databases, could probably be used much simpler and faster than you inventing your own file format and searching/sorting. – Some programmer dude Feb 20 '13 at 06:53
  • @n.m.: I know that collision-free solution is not possible. The thing is with my approach collisions are way too frequent. On a small set of 100 values after normalization I got only 60 values back (because of index overlapping)... – Mateusz Dymczyk Feb 20 '13 at 07:02
  • @JoachimPileborg: yeah I will keep on nagging my superior I guess, but I don't think it will work :) – Mateusz Dymczyk Feb 20 '13 at 07:02
  • You could use a better hash function (show your code btw), but you will still have collisions for about 1% of the values. – n. m. could be an AI Feb 20 '13 at 07:06
  • @n.m.: the code for my hash function is in the question (I actually found it way back ago, I think here on SO). The "." used to concatenate the strings are so for instance "AB" "CD" "E" doesn't produce the same string as "A" "BC" "DE". – Mateusz Dymczyk Feb 20 '13 at 07:09
  • Oh I see. This factor of 101 smells fishy. Try using a prime number between 2^31 and 2^32 instead. Or just use CRC32 as your function. – n. m. could be an AI Feb 20 '13 at 08:05
  • I do not understand what your goal is: `mmap[normalize(hash(string, string, string))]` => you use the 3 strings to compute the hash value to normalize it and that in order to retrieve the hash value... that you just computed ? Why do you want to reuse the values for ? Can you not just recompute them on the fly ? – Matthieu M. Feb 20 '13 at 08:13
  • no, the mmaped file contains some values I computed beforehand using that tuple (string, string, string). The computation takes time and external data so I want to do it once and then just read it. – Mateusz Dymczyk Feb 20 '13 at 08:17

4 Answers4

4

I'm actually surprised that you are not getting collisions before you normalize the range of hash values. It looks like you are using an unnormalized range of [0,2^32). Looking at the Birthday Problem chart here the probability of collision with 4*10^7 elements should be higher than 75%. In any case, normalizing the hash outputs to a range equal to the size of the set of elements is practically guaranteeing a non-trivial number of collisions. Unless you're willing to use a counter for your hash values I don't see how you'll be able to avoid that.

EDIT: Saw your edit. Even with a range 100 times the number of elements (which is about 4*10*9) you are still likely to get a lot of collisions. As indicated above, the probability of one or more collisions is well over 75%.

There are two things I would suggest:

Choose a Different Hash Function

As you noted, while your hash function is fast, it will not distribute values randomly in the range [0,2^32). There are several hash functions out there that are both fast and do a better job distributing hash values across the hash function range. One that I've used in the past is the MurmurHash.

Use a Larger Range

Using a larger range should reduce the risk of collisions. Looking at the chart here again it looks like 64 bits should be enough to reduce the risk of collision to less than 10^-6. The MurmurHash64A and MurmurHash64B variants will be useful in this case.

Chris Hayden
  • 1,104
  • 6
  • 6
  • Note: there is a polemic about MurmurHash and CityHash being potentially *unsafe* (ie, susceptible to adversary inputs); and SipHash was created specifically to address this issue, so might be worth considering as well. – Matthieu M. Feb 20 '13 at 08:10
  • MurmurHash was never intended to be cryptographically secure (i.e. prevent discovery of preimages with matching images) if that's what you mean. Thanks for pointing that out, though. – Chris Hayden Feb 20 '13 at 08:29
  • Well security doesn't concern my case. Thanks for the answer, I'm trying it out right now! – Mateusz Dymczyk Feb 20 '13 at 08:35
  • @ChrisHayden: neither is SipHash, however with adversary inputs you can provoke hash collisions with MurmurHash and CityHash and this led to DOS attacks against a variety of web frameworks (such as Ruby On Rails) where the opponent would send a huge amount of specifically crafted HTTP parameters that would all end up having the same hash value and thus trigger the worst-case behavior of the hash maps used to store those parameters. SipHash is crafted to resist this kind of attack. – Matthieu M. Feb 20 '13 at 08:52
  • That's interesting. I didn't know that MurmurHash was being used as the defacto hashing algorithms for hash tables in some languages. I can see why collisions would be a serious issue in that case. – Chris Hayden Feb 20 '13 at 12:27
1

It is not always possible normalize hashes to unique [0..n] values.

I can suggest to you 2 approaches:

  1. Sort you files and use binary search instead of map. (LogN complexity)
  2. Crease second file with index and implement hash table in range [0..5n] (5n maybe changed by any other number,which greate than n).
Толя
  • 2,839
  • 15
  • 23
1

You're saying that you're using this for normalization:

((unsigned int) n * hash) / max_value

and you say that max_value is UINT_MAX:

“max_value - maximum value of the old set (UINT_MAX”

And hash is declared as unsigned int.

Well, you know, then the above can only produce the values 0 and 1, which guarantees collisions.

Are you aware of the difference between integer and floating point division in C++?

If not, then I suggest getting a C++ textbook.


By the way, the casts, like "(unsigned int) blah", are a sure way to create bugs. They tell the compiler to shut up, to not tell you about possible problems, because, you're telling it, you know better. But you don't.

Community
  • 1
  • 1
Cheers and hth. - Alf
  • 142,714
  • 15
  • 209
  • 331
  • ah yeah my mistake I am casting there for (long) not (unsigned int) and it indeed gets me the values between [0,n]. And yeah it's been a while since I did some C++, still trying to remember stuff. +1 for the link to the textbook! – Mateusz Dymczyk Feb 20 '13 at 08:15
  • in Windows `long` and `int` are the same size, namely 32 bits. – Cheers and hth. - Alf Feb 20 '13 at 08:16
0

As far as i understand you need a unique hash (which is actually impossible :) ):

In Java String.hashCode() gives you a 32 bit hash code.

If you want (say) a 64-bit hashcode you can easily implement it yourself.

If you want a cryptographic hash of a String, the Java crypto libraries include implementations of MD5, SHA-1 and so on. You'll typically need to turn the String into a byte array, and then feed that to the hash generator / digest generator. For example, see @Boris Pavlović's answer.

If you want a unique hash code, you are out of luck. Hashes and hash codes are non-unique.

A Java String of length N has 65536 ^ N possible states, and requires an integer with 16 * N bits to represent all possible values. If you write a hash function that produces integer with a smaller range (e.g. less than 16 * N bits), you will eventually find cases where more than one String hashes to the same integer; i.e. the hash codes cannot be unique. This is called the Pigeonhole Principle, and there is a straight forward mathematical proof. (You can't fight math and win!)

Saqlain
  • 17,490
  • 4
  • 27
  • 33
  • Unique hashes are possible under reasonable circumstances, see [Perfect hashing](http://en.wikipedia.org/wiki/Perfect_hash_function) – MSalters Feb 20 '13 at 08:00