Fast String Hashing Algorithm with low collision rates with 32 bit integer

Question

I have lots of unrelated named things that I'd like to do quick searches against. An "aardvark" is always an "aardvark" everywhere, so hashing the string and reusing the integer would work well to speed up comparisons. The entire set of names is unknown (and changes over time). What is a fast string hashing algorithm that will generate small (32 or 16) bit values and have a low collision rate?

I'd like to see an optimized implementation specific to C/C++.

The following page has several implementations of general purpose hash functions which are "performant" and have low "collision rates": http://partow.net/programming/hashfunctions/index.html — , Oct 31 '10 at 23:08

score 35 · Answer 1 · answered Sep 22 '08 at 10:17

35

Murmur Hash is pretty nice.

answered Sep 22 '08 at 10:17

yrp

4,565
2
24
10

3

Yes, this is the current leading general purpose hash function for hash tables. It's non-crypto, of course, with a pair of obvious differential. – obecalp Feb 16 '09 at 19:21
1

Note: the new URL for MurmurHash3 is https://code.google.com/p/smhasher/ – Emil Styrke Nov 27 '13 at 14:03

score 30 · Accepted Answer · answered Sep 22 '08 at 10:08

30

One of the FNV variants should meet your requirements. They're fast, and produce fairly evenly distributed outputs.

answered Sep 22 '08 at 10:08

Nick Johnson

100,655
16
128
198

If you're going to use FNV, stick to FNV-1a, since it has acceptable results on the avalanche test (see http://home.comcast.net/~bretm/hash/6.html). Or just use MurmurHash2, which is better in both speed and distribution (http://murmurhash.googlepages.com/). – Steven Sudit Jul 10 '09 at 03:38
7

@Steven: MurmurHash hash has only been analzyed by its author. I've used it in a few different scenarios and the newer version of FNV seems to do a better job. – Jan 25 '11 at 22:03
@sonicoder: While I'm not going to oversell MurmurHash, plain FNV is downright terrible and FNV-1a is only passable. As it happens, MurmurHash has been extensively analyzed and found useful. It's still not a cryptographic hash and there are going to be collisions no matter what, but it's still a huge improvement over any type of FNV. – Steven Sudit Jan 26 '11 at 20:50
8

@Steven Sudit: As I said, it was analyzed "only" by its author and no one else. hence the results of the "analysis" aren't really objective. – Jan 30 '11 at 12:07
6

@sonicoder: I'll speak more bluntly: no, you're mistaken. It was analyzed by a number of third parties, including academic ones. Visit Wikipedia for links. What's more important is that, not only did it do well in general, but the specific flaws that were found were addressed through the creation of MurmurHash3. – Steven Sudit Feb 03 '11 at 07:25
Also, I would appreciate it if whoever's been dittoing sonicoder would take the time to explain themselves. – Steven Sudit Feb 03 '11 at 07:29
I don't understand those bitshift values `hval += (hval<<1) + (hval<<4) + (hval<<7) + (hval<<8) + (hval<<24);` – jokoon Mar 06 '13 at 10:26
FNV hashes are slow and of mediocre quality - do some measurements and read the actual papers (which contain quite a bit of hair-raising nonsense) instead of taking the author's claims at face value and helping them perpetuate the myth that FNV hashes are fast and of good quality. Stick with a hash of proven exceptional quality like MurmurHash or pick one of the many simple hashes that beat FNV in speed and quality. About the only thing that FNV has going for it is its simplicity, if performance is not a concern (or if the cost of operations is skewed, as in interpreted languages). – DarthGizka Sep 17 '16 at 11:03
@DarthGizka, Example of this "one of the many simple hashes that beat FNV in speed and quality"?? I'm really interested. – Feb 15 '21 at 19:54
@Richard: depending on your chosen application you could try self-xoring the rotated accumulator (with a rotation that is relatively prime to the word size) or an xorshift, in any case followed at the end of hashing by a spreading multiplication. CRC has excellent quality and can be really fast if you can use the intrinsics. When things get trickier I'm certain that Bob Jenkins's [extensive and thorough research into hashes](https://burtleburtle.net/bob/hash/index.html) has covered exactly what you need, somewhere. Prime moduli can be fast if you multiply w/ inverse instead of dividing. – DarthGizka Feb 15 '21 at 20:42

score 19 · Answer 3 · answered Dec 16 '08 at 22:25

There's also a nice article at eternallyconfuzzled.com.

Jenkins' One-at-a-Time hash for strings should look something like this:

#include <stdint.h>

uint32_t hash_string(const char * s)
{
    uint32_t hash = 0;

    for(; *s; ++s)
    {
        hash += *s;
        hash += (hash << 10);
        hash ^= (hash >> 6);
    }

    hash += (hash << 3);
    hash ^= (hash >> 11);
    hash += (hash << 15);

    return hash;
}

score 17 · Answer 4 · edited May 23 '17 at 12:02

17

For a fixed string-set use gperf.

If your string-set changes you have to pick one hash function. That topic has been discussed before:

What's the best hashing algorithm to use on a stl string when using hash_map?

edited May 23 '17 at 12:02

Community

1
1

answered Sep 22 '08 at 10:13

Nils Pipenbrinck

83,631
31
151
221

A perfect hash is a very elegant solution, when available. – Steven Sudit Jan 27 '11 at 18:23

Carl Seleborg · Answer 5 · 2008-09-22T13:08:52.407

Another solution that could be even better depending on your use-case is interned strings. This is how symbols work e.g. in Lisp.

An interned string is a string object whose value is the address of the actual string bytes. So you create an interned string object by checking in a global table: if the string is in there, you initialize the interned string to the address of that string. If not, you insert it, and then initialize your interned string.

This means that two interned strings built from the same string will have the same value, which is an address. So if N is the number of interned strings in your system, the characteristics are:

Slow construction (needs lookup and possibly memory allocation)
Requires global data and synchronization in the case of concurrent threads
Compare is O(1), because you're comparing addresses, not actual string bytes (this means sorting works well, but it won't be an alphabetic sort).

Cheers,

Carl

Antonio Morales · Answer 6 · 2013-09-29T12:27:08.900

It is never late for a good subject and I am sure people would be interested on my findings.

I needed a hash function and after reading this post and doing a bit of research on the links given here, I came up with this variation of Daniel J Bernstein's algorithm, which I used to do an interesting test:

unsigned long djb_hashl(const char *clave)
{
    unsigned long c,i,h;

    for(i=h=0;clave[i];i++)
    {
        c = toupper(clave[i]);
        h = ((h << 5) + h) ^ c;
    }
    return h;
}

This variation hashes strings ignoring the case, which suits my need of hashing users login credentials. 'clave' is 'key' in Spanish. I am sorry for the spanish but its my mother tongue and the program is written on it.

Well, I wrote a program that will generate usernames from 'test_aaaa' to 'test_zzzz', and -to make the strings longer- I added to them a random domain in this list: 'cloud-nueve.com', 'yahoo.com', 'gmail.com' and 'hotmail.com'. Therefore each of them would look like:


test_aaaa@cloud-nueve.com, test_aaab@yahoo.com, 
test_aaac@gmail.com, test_aaad@hotmail.com and so on.

Here is the output of the test -'Colision entre XXX y XXX' means 'Collision of XXX and XXX'. 'palabras' means 'words' and 'Total' is the same in both languages-.


    Buscando Colisiones...
    Colision entre 'test_phiz@hotmail.com' y 'test_juxg@cloud-nueve.com' (1DB903B7)
    Colision entre 'test_rfhh@hotmail.com' y 'test_fpgo@yahoo.com' (2F5BC088)
    Colision entre 'test_wxuj@hotmail.com' y 'test_pugy@cloud-nueve.com' (51FD09CC)
    Colision entre 'test_sctb@gmail.com' y 'test_iohw@cloud-nueve.com' (52F5480E)
    Colision entre 'test_wpgu@cloud-nueve.com' y 'test_seik@yahoo.com' (74FF72E2)
    Colision entre 'test_rfll@hotmail.com' y 'test_btgo@yahoo.com' (7FD70008)
    Colision entre 'test_wcho@cloud-nueve.com' y 'test_scfz@gmail.com' (9BD351C4)
    Colision entre 'test_swky@cloud-nueve.com' y 'test_fqpn@gmail.com' (A86953E1)
    Colision entre 'test_rftd@hotmail.com' y 'test_jlgo@yahoo.com' (BA6B0718)
    Colision entre 'test_rfpp@hotmail.com' y 'test_nxgo@yahoo.com' (D0523F88)
    Colision entre 'test_zlgo@yahoo.com' y 'test_rfdd@hotmail.com' (DEE08108)
    Total de Colisiones: 11
    Total de Palabras  : 456976

That is not bad, 11 collisions out of 456,976 (off course using the full 32 bit as table lenght).

Running the program using 5 chars, that is from 'test_aaaaa' to 'test_zzzzz', actually runs out of memory building the table. Below is the output. 'No hay memoria para insertar XXXX (insertadas XXX)' means 'There is not memory left to insert XXX (XXX inserted)'. Basically malloc() failed at that point.


    No hay memoria para insertar 'test_epjcv' (insertadas 2097701).

    Buscando Colisiones...

    ...451 'colision' strings...

    Total de Colisiones: 451
    Total de Palabras  : 2097701

Which means just 451 collisions on 2,097,701 strings. Note that in none of the occasions, there were more than 2 collisions per code. Which I confirm it is a great hash for me, as what I need is to convert the login ID to a 40 bit unique id for indexing. So I use this to convert the login credentials to a 32 bit hash and use the extra 8 bits to handle up to 255 collisions per code, which lookign at the test results would be almost impossible to generate.

Hope this is useful to someone.

EDIT:

Like the test box is AIX, I run it using LDR_CNTRL=MAXDATA=0x20000000 to give it more memory and it run longer, the results are here:

Buscando Colisiones... Total de Colisiones: 2908 Total de Palabras : 5366384

That is 2908 after 5,366,384 tries!!

VERY IMPORTANT: Compiling the program with -maix64 (so unsigned long is 64 bits), the number of collisions is 0 for all cases!!!

There seems to be a bug in your code: your `h` needs to be initialized to zero, or you might start with uninitialized memory and get non-reproducible changing hashes which would be kind of bad. — E. T., Aug 05 '22 at 10:58

score 3 · Answer 7 · answered Sep 24 '08 at 04:13

3

The Hsieh hash function is pretty good, and has some benchmarks/comparisons, as a general hash function in C. Depending on what you want (it's not completely obvious) you might want to consider something like cdb instead.

answered Sep 24 '08 at 04:13

James Antill

2,825
18
16

score 3 · Answer 8 · answered Dec 16 '08 at 21:11

3

Why don't you just use Boost libraries? Their hashing function is simple to use and most of the stuff in Boost will soon be part of the C++ standard. Some of it already is.

Boost hash is as easy as

#include <boost/functional/hash.hpp>

int main()
{
    boost::hash<std::string> string_hash;

    std::size_t h = string_hash("Hash me");
}

You can find boost at boost.org

answered Dec 16 '08 at 21:11

Bernard Igiri

1,272
2
18
33

5

Both STL and boost tr1 has extremely weak hash function for strings. – obecalp Feb 16 '09 at 19:19

score 3 · Answer 9 · answered Dec 16 '08 at 21:30

3

Bob Jenkins has many hash functions available, all of which are fast and have low collision rates.

answered Dec 16 '08 at 21:30

user7116

63,008
17
141
172

1

The hashes are quite solid, and technically interesting, but not necessarily fast. Consider that One-at-a-Time hash processes input byte by byte, where other hashes take 4 or even 8 bytes at a time. The speed differnece is substantial! – Steven Sudit Jul 23 '09 at 15:23
Bob's hashes are very fast: http://www.azillionmonkeys.com/qed/hash.html – user7116 Jul 23 '09 at 20:20

score 2 · Answer 10 · answered Sep 22 '08 at 10:06

2

Have a look at GNU gperf.

answered Sep 22 '08 at 10:06

Rob Wells

36,220
13
81
146

Yay for perfect hash generators! – C. K. Young Sep 22 '08 at 10:08
5

Perfect hashing is NOT appropriate for this application, since the set of names is unknown and changes. Therefore, gperf won't work for this. – TimB Sep 24 '08 at 04:43

score 2 · Answer 11 · edited May 23 '17 at 12:25

2

There is some good discussion in this previous question

And a nice overview of how to pick hash functions, as well as statistics about the distribution of several common ones here

edited May 23 '17 at 12:25

Community

1
1

answered Dec 09 '08 at 21:29

AShelly

34,686
15
91
152

score 2 · Answer 12 · answered Dec 16 '08 at 21:34

You can see what .NET uses on the String.GetHashCode() method using Reflector.

I would hazard a guess that Microsoft spent considerable time optimising this. They have printed in all the MSDN documentation too that it is subject to change all the time. So clearly it is on their "performance tweaking radar" ;-)

Would be pretty trivial to port to C++ too I would have thought.

score -1 · Answer 13 · answered Apr 17 '15 at 03:33

Described here is a simple way of implementing it yourself: http://www.devcodenote.com/2015/04/collision-free-string-hashing.html

A snippet from the post:

if say we have a character set of capital English letters, then the length of the character set is 26 where A could be represented by the number 0, B by the number 1, C by the number 2 and so on till Z by the number 25. Now, whenever we want to map a string of this character set to a unique number , we perform the same conversion as we did in case of the binary format

How does that (given a hypertext browser that does display that links contents) map to `(32 or 16) bit values`, given character sets from, say, 24 to 1.111.998 code points? Base conversion is not a useful hash function. — greybeard, Apr 17 '15 at 04:01

score -3 · Answer 14 · answered Sep 22 '08 at 10:06

-3

CRC-32. There is about a trillion links on google for it.

answered Sep 22 '08 at 10:06

1800 INFORMATION

131,367
29
160
239

8

CRCs are designed for error detection and correction. Their distribution characteristics are typically not very good. – Nick Johnson Sep 22 '08 at 10:09
1

Arachnid has obviously never tried CRC32 as hashes. They work well. – Nils Pipenbrinck Sep 22 '08 at 10:11
9

"CRC32 was never intended for hash table use. There is really no good reason to use it for this purpose." cf. http://home.comcast.net/~bretm/hash/8.html – obecalp Feb 16 '09 at 21:24

Fast String Hashing Algorithm with low collision rates with 32 bit integer

14 Answers14

Linked