I have a long list of English words and I would like to hash them. What would be a good hashing function? So far my hashing function sums the ASCII values of the letters then modulo the table size. I'm looking for something efficient and simple.
-
1Check here http://www.cse.yorku.ca/~oz/hash.html – c-smile Oct 08 '11 at 23:30
-
Possible duplicate of [Good Hash Function for Strings](https://stackoverflow.com/questions/2624192/good-hash-function-for-strings) and [What is a good 64bit hash function in Java for textual strings?](https://stackoverflow.com/questions/1660501/what-is-a-good-64bit-hash-function-in-java-for-textual-strings) – M.J. Rayburn Sep 02 '17 at 01:00
-
A good answer to this question is available on other stackexchange site: https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed – Yann Droneaud Aug 14 '18 at 14:32
4 Answers
To simply sum the letters is not a good strategy because a permutation gives the same result.
This one (djb2) is quite popular and works nicely with ASCII strings.
unsigned long hashstring(unsigned char *str)
{
unsigned long hash = 5381;
int c;
while (c = *str++)
hash = ((hash << 5) + hash) + c; /* hash * 33 + c */
return hash;
}
More info here.
If you need more alternatives and some perfomance measures, read here.
Added: These are general hashing functions, where the input domain is not known in advance (except perhaps some very general assumptions: eg the above works slightly better with ascii input), which is the most usual scenario. If you have a known restricted domain (set of inputs fixed) you can do better, see Fionn's answer.

- 73,180
- 20
- 142
- 190
-
-
1@MikeG: that is the "seed" or starting value. This one is commonly known as the "Times 33" hash. – user7116 Oct 08 '11 at 23:32
-
1@sixlettervariables where do i specifiy the table length ? what if it returns a number greater than my table ? – Mike G Oct 08 '11 at 23:35
-
3It can return any valid `unsigned long` value, in theory. It's up to you to manipulate the hash to suit your constraints. – Jonathan Grynspan Oct 08 '11 at 23:43
-
2@MikeG: in general, you do not specify the table size in the hash algorithm (and if you don't know about it, use an already made table...). The table may grow or shrink depending on the number of items (for good implementations), so you just compute the hash, and take the hash modulo the current size to know in which bucket to put it in. – Matthieu M. Oct 09 '11 at 09:57
Maybe something like this would help you: http://www.gnu.org/s/gperf/
It generates a optimized hashing function for the input domain.

- 10,975
- 11
- 54
- 84
If you don't need it be cryptographically secure, I would suggest the Murmur Hash. It's extremely fast and has high diffusion. Easy to use.
http://en.wikipedia.org/wiki/MurmurHash
http://code.google.com/p/smhasher/wiki/MurmurHash3
If you do need a cryptographically secure hash, then I suggest SHA1 via OpenSSL.

- 100,020
- 15
- 103
- 173
-
+1 for MurmurHash, do you know if of a comparison between CityHash and MurmurHash ? I have heard good things about both, but never saw a comprehensive comparison, just had some anecdotical facts. – Matthieu M. Oct 09 '11 at 09:59
A bit late, but here is a hashing function with an extremely low collision rate for 64-bit version below, and ~almost~ as good for the 32-bit version:
uint64_t slash_hash(const char *s)
//uint32_t slash_hash(const char *s)
{
union { uint64_t h; uint8_t u[8]; } uu;
int i=0; uu.h=strlen(s);
while (*s) { uu.u[i%8] += *s + i + (*s >> ((uu.h/(i+1)) % 5)); s++; i++; }
return uu.h; //64-bit
//return (uu.h+(uu.h>>32)); //32-bit
}
The hash-numbers are also very evenly spread across the possible range, with no clumping that I could detect - this was checked using the random strings only.
[edit]
Also tested against words extracted from local text-files combined with LibreOffice dictionary/thesaurus words (English and French - more than 97000 words and constructs) with 0 collisions in 64-bit and 1 collision in 32-bit :)
(Also compared with FNV1A_Hash_Yorikke, djb2 and MurmurHash2 on same sets: Yorikke & djb2 did not do well; slash_hash did slightly better than MurmurHash2 in all the tests)

- 12,698
- 8
- 66
- 57

- 7,069
- 9
- 54
- 80
-
1This is a reasonable hash function. I suggest avoiding the unnamed union. -->> `union { uint64_t h; uint8_t u[8]; } uu;` and similar changes in the code -->> `uu.h=strlen(s);` ... `uu.u[i%8] += ...` etc – joop Jan 05 '17 at 11:13