What's a good hash function for English words?

Question

I have a long list of English words and I would like to hash them. What would be a good hashing function? So far my hashing function sums the ASCII values of the letters then modulo the table size. I'm looking for something efficient and simple.

Possible duplicate of [Good Hash Function for Strings](https://stackoverflow.com/questions/2624192/good-hash-function-for-strings) and [What is a good 64bit hash function in Java for textual strings?](https://stackoverflow.com/questions/1660501/what-is-a-good-64bit-hash-function-in-java-for-textual-strings) — M.J. Rayburn, Sep 02 '17 at 01:00
A good answer to this question is available on other stackexchange site: https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed — Yann Droneaud, Aug 14 '18 at 14:32

leonbloy · Accepted Answer · 2020-09-25T12:35:20.237

24

To simply sum the letters is not a good strategy because a permutation gives the same result.

This one (djb2) is quite popular and works nicely with ASCII strings.

unsigned long hashstring(unsigned char *str)
{
    unsigned long hash = 5381;
    int c;

    while (c = *str++)
        hash = ((hash << 5) + hash) + c; /* hash * 33 + c */

    return hash;
}

More info here.

If you need more alternatives and some perfomance measures, read here.

Added: These are general hashing functions, where the input domain is not known in advance (except perhaps some very general assumptions: eg the above works slightly better with ascii input), which is the most usual scenario. If you have a known restricted domain (set of inputs fixed) you can do better, see Fionn's answer.

edited Sep 25 '20 at 12:35

answered Oct 08 '11 at 23:28

leonbloy

73,180
20
142
190

No, it's just a "seed", rather arbitrary. – leonbloy Oct 08 '11 at 23:32
1

@MikeG: that is the "seed" or starting value. This one is commonly known as the "Times 33" hash. – user7116 Oct 08 '11 at 23:32
1

@sixlettervariables where do i specifiy the table length ? what if it returns a number greater than my table ? – Mike G Oct 08 '11 at 23:35
3

It can return any valid `unsigned long` value, in theory. It's up to you to manipulate the hash to suit your constraints. – Jonathan Grynspan Oct 08 '11 at 23:43
2

@MikeG: in general, you do not specify the table size in the hash algorithm (and if you don't know about it, use an already made table...). The table may grow or shrink depending on the number of items (for good implementations), so you just compute the hash, and take the hash modulo the current size to know in which bucket to put it in. – Matthieu M. Oct 09 '11 at 09:57

score 10 · Answer 2 · answered Oct 08 '11 at 23:25

10

Maybe something like this would help you: http://www.gnu.org/s/gperf/

It generates a optimized hashing function for the input domain.

answered Oct 08 '11 at 23:25

Fionn

10,975
11
54
84

score 6 · Answer 3 · answered Oct 08 '11 at 23:29

6

If you don't need it be cryptographically secure, I would suggest the Murmur Hash. It's extremely fast and has high diffusion. Easy to use.

http://en.wikipedia.org/wiki/MurmurHash

http://code.google.com/p/smhasher/wiki/MurmurHash3

If you do need a cryptographically secure hash, then I suggest SHA1 via OpenSSL.

http://www.openssl.org/docs/crypto/sha.html

answered Oct 08 '11 at 23:29

selbie

100,020
15
103
173

+1 for MurmurHash, do you know if of a comparison between CityHash and MurmurHash ? I have heard good things about both, but never saw a comprehensive comparison, just had some anecdotical facts. – Matthieu M. Oct 09 '11 at 09:59

score 3 · Answer 4 · edited Nov 24 '20 at 13:48

A bit late, but here is a hashing function with an extremely low collision rate for 64-bit version below, and ~almost~ as good for the 32-bit version:

uint64_t slash_hash(const char *s)
//uint32_t slash_hash(const char *s)
{
    union { uint64_t h; uint8_t u[8]; } uu;
    int i=0; uu.h=strlen(s);
    while (*s) { uu.u[i%8] += *s + i + (*s >> ((uu.h/(i+1)) % 5)); s++; i++; }
    return uu.h; //64-bit
    //return (uu.h+(uu.h>>32)); //32-bit
}

The hash-numbers are also very evenly spread across the possible range, with no clumping that I could detect - this was checked using the random strings only.
[edit]
Also tested against words extracted from local text-files combined with LibreOffice dictionary/thesaurus words (English and French - more than 97000 words and constructs) with 0 collisions in 64-bit and 1 collision in 32-bit :)

(Also compared with FNV1A_Hash_Yorikke, djb2 and MurmurHash2 on same sets: Yorikke & djb2 did not do well; slash_hash did slightly better than MurmurHash2 in all the tests)

This is a reasonable hash function. I suggest avoiding the unnamed union. -->> `union { uint64_t h; uint8_t u[8]; } uu;` and similar changes in the code -->> `uu.h=strlen(s);` ... `uu.u[i%8] += ...` etc — joop, Jan 05 '17 at 11:13

What's a good hash function for English words?

4 Answers4

Linked

Related