172

I'm working on hash table in C language and I'm testing hash function for string.

The first function I've tried is to add ascii code and use modulo (% 100) but i've got poor results with the first test of data: 40 collisions for 130 words.

The final input data will contain 8000 words (it's a dictionary stores in a file). The hash table is declared as int table[10000] and contains the position of the word in a .txt file.

  • Which is the best algorithm for hashing string?
  • And how to determinate the size of hash table?
Gabriel Staples
  • 36,492
  • 15
  • 194
  • 265
lilawood
  • 2,263
  • 5
  • 22
  • 27
  • 11
    If your hash table has 10K entries, why would you use modulo 100? Getting 40 collisions out of 130 words isn't surprising with such a small modulus. – Carey Gregory Oct 05 '11 at 19:24
  • There are many string hash implementations available on both google and SO (read: more searching is in order). Many approaches use a "barrel shift" or "rolling" hash (possibly with "mixing" phases) -- but please pay heed to Gregory! –  Oct 05 '11 at 19:26
  • 14
    See http://burtleburtle.net/bob/hash/evahash.html and http://www.partow.net/programming/hashfunctions/ for which are resources about various hashing (from general to string to crypto). –  Oct 05 '11 at 19:33
  • 5
    To clarify @CareyGregory: You do realize that, as a basic mathematical truth, 130 items in 100 buckets (i.e., mod 100) must produce 30 collisions (where collision is counted as each time a second, third, etc. item is put in a bucket), correct? So you're only a little above that. – derobert Oct 05 '11 at 19:34
  • @derobert: Yes, a minimum of 30 collisions. But that's not my question. The OP states the hash table has 10K entries. So why a modulus of 100? Unless I misunderstand what lilawood meant, that would leave 9900 entries in the hash table unusable. – Carey Gregory Oct 05 '11 at 19:40
  • @CareyGregory: I think its the 8,000 word version that is using the 10,000-entry hash—hopefully with a higher modulus. I'm guessing the 130-word version (with mod 100) is a reduced-size problem for easier debugging, etc. That clarification is aimed at lilawood, to explain why (as you commented) 40 collisions isn't surprising. – derobert Oct 05 '11 at 19:48
  • I've tried only a small part of input data to test the algorithm as @derobert said. – lilawood Oct 05 '11 at 19:54
  • 4
    @lilawood: OK, that's what I figured, but to be a better test you should use 80 words with a hash table of 100 entries. That would give you the same proportions as your live data and wouldn't force collisions. – Carey Gregory Oct 05 '11 at 21:29
  • 5
    Possible duplicate of [Good Hash Function for Strings](https://stackoverflow.com/questions/2624192/good-hash-function-for-strings) – M.J. Rayburn Sep 02 '17 at 01:01
  • the best answers for this question can be found on another stackexchange site: https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed – Yann Droneaud Aug 14 '18 at 14:28
  • Related: [Can CRC32 be used as a hash function?](https://stackoverflow.com/q/10953958/4561887) – Gabriel Staples Feb 22 '22 at 16:52

11 Answers11

258

I've had nice results with djb2 by Dan Bernstein.

unsigned long
hash(unsigned char *str)
{
    unsigned long hash = 5381;
    int c;

    while (c = *str++)
        hash = ((hash << 5) + hash) + c; /* hash * 33 + c */

    return hash;
}
cnicutar
  • 178,505
  • 25
  • 365
  • 392
  • 55
    the page linked in the answer is very interesting. – Adrien Plisson Oct 05 '11 at 19:31
  • 2
    how the program runs out of the while loop?? =S – Daniel N. May 19 '15 at 01:30
  • 4
    @danfly09 When c is zero. The equivalent of while(c = *str++) would be (0 != (c = *str++)) – rxantos Jul 05 '15 at 19:58
  • 1
    I dont understand how this works, this is gonna give me a big number right? what do I have to do if I have a table of size 13, take the mod 13 of the number this function gives to me? – Josepas May 07 '16 at 16:01
  • 6
    @Josepas the hash function should ideally return a `size_t` or other such unsigned value (such as the unsigned long in this code). The *caller* is responsible for taking modulo of the result to fit it to the hash table. The caller controls the table slot being hashed to; not the function. It just returns some unsigned number. – WhozCraig Aug 18 '16 at 06:58
  • 6
    amazing. this algorithm beat the hell out of Murmur hash, FNV variants hashes and many others! +1 – David Haim Sep 04 '16 at 14:40
  • I just posted the K&R 1st & 2nd edition hash algorithms here (https://stackoverflow.com/a/45641002/4561887) too. Although the 1st edition is purported by Dan Bernstein to be pretty terrible, I'm guessing the 2nd edition one is pretty decent. I think it's good to be aware of both of these for at least historical reasons and general awareness. – Gabriel Staples Aug 11 '17 at 17:52
  • Why isn't `c` of `unsigned char` type or `unsigned long` type. Is there any special reason for it to be of `int` type? – CopperCash Dec 25 '17 at 01:02
  • @CopperCash The reason so is because an int == char, but save headache of worrying about <<'s and >>'s, esp in unoptimized code - remember a computer allocates char as a WORD and not a singular byte (depending on architecture). Fortunately, signedness of these characters do not matter, because how C handles it on `+ c;` Check it out here https://ideone.com/4XasWc – user2262111 Apr 19 '18 at 23:49
  • https://ideone.com/AV35CR This link contains the same function using `stdint.h`, however, ideone which uses `gcc 6.3.0` and had troubles using a `char*` for `unsigned char*` input. Use the latest compiler :) – user2262111 Apr 20 '18 at 00:14
  • @user2262111 thx for the explanation. So, `long` or `unsigned long` should work as well, right? Because it is "wider" than a WORD. – CopperCash Apr 20 '18 at 12:01
  • @CopperCash a WORD is 16bits, a DWORD is your typical 32bits. longs are usually DWORD size. long is also analogous to long int. long long however have more bits to them. You probably want the standard wide variants of these. Like wchar_t for example. – user2262111 Sep 02 '18 at 17:40
  • Hi, There is a slight improvement required in your code, after h=((h<<5)+h)+c; you should also be doing , h=h%MAX, where MAX is the max range for the HASH table, else there are chances of overflow, which may not be detected. – AnotherDeveloper Mar 30 '19 at 05:32
  • Any particular reason why this takes `unsigned char*` instead of `char*`? I mean, textual strings are usually represented simply as `char*`. Am I missing something? Does the algorithm assume `unsigned char*`? – Aviv Cohn May 02 '19 at 21:41
  • the formula at the link shared says, `hash(i) = hash(i - 1) * 33 ^ str[i];` but it is actually implemented as `hash(i) = (hash(i - 1) * 33) + str[i];`...is the former just a typo? – thunderbird Jul 07 '19 at 07:29
  • @AvivCohn, if it took `char*`, then `c` would be a negative number whenever `*str++ < 0` on systems where `char` is signed. – Emile Cormier Feb 14 '20 at 02:10
  • @thunderbird It says "**another version** of this algorithm (now favored by bernstein) uses xor" [emphasis mine] – Emile Cormier Feb 14 '20 at 02:20
  • @user2262111 For me (gcc on Linux x64), `sizeof(long)` is 8 bytes, also using word to measure size isn't very portable, since different processors have different word size. All of this confusion can be avoided by using stdint.h – Shipof123 Sep 09 '20 at 22:05
  • why this uses an unsigned char and how can i use that if use char – codemonkey Jan 01 '21 at 18:02
  • @codemonkey I prefer people to use stdint.h which I touched on at the end. I require in all my programs the size I require. However, when you malloc(1 * sizeof(char)) regardless of your system size of 1 or 2 bytes per char, you get a system WORD instead. My description of WORD sizes seemed to leave out the fact that DWORD and WORD are used interchangeably because people. The true canonical term for WORD is 4 bytes on a 32bit processor and 8 bytes on a 64bit processor. Therefore your char whether perceived as 1 or 2 bytes is still 8. This is due to register girth. – user2262111 Jan 20 '21 at 21:43
  • And don't get me started on what malloc does. – user2262111 Jan 20 '21 at 21:44
  • @HumbleDeveloper It doesn't matter if it overflows. – doug65536 Jul 09 '21 at 14:13
  • 1
    Why 33 instead of 31? 33 isn't prime, 31 is. Java uses this algorithm with 31 instead of 33 for `String.hashCode`. – doug65536 Jul 09 '21 at 14:15
  • Seems like this function can't handle large strings, can it? Does it employ overflow as a feature? What are the boundaries of possible values of this function? – IC_ Sep 17 '22 at 13:42
32

First, you generally do not want to use a cryptographic hash for a hash table. An algorithm that's very fast by cryptographic standards is still excruciatingly slow by hash table standards.

Second, you want to ensure that every bit of the input can/will affect the result. One easy way to do that is to rotate the current result by some number of bits, then XOR the current hash code with the current byte. Repeat until you reach the end of the string. Note that you generally do not want the rotation to be an even multiple of the byte size either.

For example, assuming the common case of 8 bit bytes, you might rotate by 5 bits:

int hash(char const *input) { 
    int result = 0x55555555;

    while (*input) { 
        result ^= *input++;
        result = rol(result, 5);
    }
}

Edit: Also note that 10000 slots is rarely a good choice for a hash table size. You usually want one of two things: you either want a prime number as the size (required to ensure correctness with some types of hash resolution) or else a power of 2 (so reducing the value to the correct range can be done with a simple bit-mask).

Jerry Coffin
  • 476,176
  • 80
  • 629
  • 1,111
  • This isn't c, but I would be interested in your thoughts to this related answer: http://stackoverflow.com/a/31440118/3681880 – Suragch Jul 15 '15 at 21:24
  • 1
    @Suragch: Since I wrote this, quite a few processors have started to include either special hardware to accelerate SHA computation, which has made it much more competitive. That said, I doubt your code is quite as safe as you think--for example, IEEE floating point numbers have two different bit patterns (0 and -0) that should produce the same hashes (they'll compare as equal to each other). – Jerry Coffin Jul 15 '15 at 22:02
  • @Jerry Coffin which library do I need for the rol() function? – thanos.a Mar 28 '20 at 15:50
  • @thanos.a: I'm not aware of it being in a library, but rolling your own only takes a line or two of code. Shift one chunk left, the other chunk right, and or them together. – Jerry Coffin Mar 29 '20 at 04:40
  • 2
    @thanos.a, you can hand-roll it like `static inline unsigned rol(unsigned r, int k) {return (r << k) | (r >> (32 - k));}` (assuming 32-bit integers). At least GCC on x86-64 does compile this down to one instruction. – vonbrand Jul 22 '21 at 00:27
  • Some critique: `result` need only be non-0; 1 is fine. Good, XORing input beats ADD. Using `+=` for `ROL` is much better than `=`. `ROL` can be `SHL`, if needed. Changing 5 to 7 reduces collisions. Two line finalizer `h += h<<17; h ^= h>>12;` will avalanche a bit, improving uniformity. With that, potentially the best minimal 32-bit hash/checksum in 8 instructions, beating comparable functions like ELFHash/BKDRHash/DJB2. Though ultimately sub-optimal on 64-bit machines. – bryc May 26 '22 at 12:17
  • @bryc: Using a left shift instead of a rotate means that if you hash a long enough string, the characters you hash first will have no effect on the final result. – Jerry Coffin May 26 '22 at 16:00
  • With your original code, or mine ([code here](https://replit.com/@bryc1/shifty-hash#main.c))? Can't seem to reproduce that issue on mine; changing the first char of a long string changes hash as expected. I'm also noticing if I switch to `ROL 7` instead of `SHL 7` in my function, it actually appears to fail more collision tests, specifically a single rolling byte test (starting from index 0, a byte of 1 moves right across 16384 null bytes, individually hashing the full array each step). – bryc May 26 '22 at 17:59
22

I wanted to verify Xiaoning Bian's answer, but unfortunately he didn't post his code. So I implemented a little test suite and ran different little hashing functions on the list of 466K English words to see number of collisions for each:

Hash function      |     Collisions | Time (words) | Time (file)
=================================================================
CRC32              |    23 (0.005%) |      112 ms  |      38 ms
MurmurOAAT         |    26 (0.006%) |       86 ms  |      10 ms
FNV hash           |    32 (0.007%) |       87 ms  |       7 ms
Jenkins OAAT       |    36 (0.008%) |       90 ms  |       8 ms
DJB2 hash          |   344 (0.074%) |       87 ms  |       5 ms
K&R V2             |   356 (0.076%) |       86 ms  |       5 ms
Coffin             |   763 (0.164%) |       86 ms  |       4 ms
x17 hash           |  2242 (0.481%) |       87 ms  |       7 ms
-----------------------------------------------------------------
MurmurHash3_x86_32 |    19 (0.004%) |       90 ms  |       3 ms

I included time for both: hashing all words individually and hashing the entire file of all English words once. I also included a more complex MurmurHash3_x86_32 into my test for reference.

Conclusion:

  • there is almost no point of using the popular DJB2 hash function for strings on Intel x86-64 (or AArch64 for that matter) architecture. Because it has much more collisions than similar functions (MurmurOAAT, FNV and Jenkins OAAT) while having very similar throughput. Bernstein's DJB2 performs especially bad on short strings. Example collisions: Liz/MHz, Bon/COM, Rey/SEX.

Test code:

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>

#define MAXLINE 2048
#define SEED    0x12345678

uint32_t DJB2_hash(const uint8_t *str)
{
    uint32_t hash = 5381;
    uint8_t c;
    while ((c = *str++))
        hash = ((hash << 5) + hash) + c; /* hash * 33 + c */
    return hash;
}

uint32_t FNV(const void* key, int len, uint32_t h)
{
    // Source: https://github.com/aappleby/smhasher/blob/master/src/Hashes.cpp
    h ^= 2166136261UL;
    const uint8_t* data = (const uint8_t*)key;
    for(int i = 0; i < len; i++)
    {
        h ^= data[i];
        h *= 16777619;
    }
    return h;
}

uint32_t MurmurOAAT_32(const char* str, uint32_t h)
{
    // One-byte-at-a-time hash based on Murmur's mix
    // Source: https://github.com/aappleby/smhasher/blob/master/src/Hashes.cpp
    for (; *str; ++str) {
        h ^= *str;
        h *= 0x5bd1e995;
        h ^= h >> 15;
    }
    return h;
}

uint32_t KR_v2_hash(const char *s)
{
    // Source: https://stackoverflow.com/a/45641002/5407270
    // a.k.a. Java String hashCode()
    uint32_t hashval = 0;
    for (hashval = 0; *s != '\0'; s++)
        hashval = *s + 31*hashval;
    return hashval;
}

uint32_t Jenkins_one_at_a_time_hash(const char *str, size_t len)
{
    uint32_t hash, i;
    for(hash = i = 0; i < len; ++i)
    {
        hash += str[i];
        hash += (hash << 10);
        hash ^= (hash >> 6);
    }
    hash += (hash << 3);
    hash ^= (hash >> 11);
    hash += (hash << 15);
    return hash;
}

uint32_t crc32b(const uint8_t *str) {
    // Source: https://stackoverflow.com/a/21001712
    unsigned int byte, crc, mask;
    int i = 0, j;
    crc = 0xFFFFFFFF;
    while (str[i] != 0) {
        byte = str[i];
        crc = crc ^ byte;
        for (j = 7; j >= 0; j--) {
            mask = -(crc & 1);
            crc = (crc >> 1) ^ (0xEDB88320 & mask);
        }
        i = i + 1;
    }
    return ~crc;
}

inline uint32_t _rotl32(uint32_t x, int32_t bits)
{
    return x<<bits | x>>(32-bits);      // C idiom: will be optimized to a single operation
}

uint32_t Coffin_hash(char const *input) { 
    // Source: https://stackoverflow.com/a/7666668/5407270
    uint32_t result = 0x55555555;
    while (*input) { 
        result ^= *input++;
        result = _rotl32(result, 5);
    }
    return result;
}

uint32_t x17(const void * key, int len, uint32_t h)
{
    // Source: https://github.com/aappleby/smhasher/blob/master/src/Hashes.cpp
    const uint8_t * data = (const uint8_t*)key;
    for (int i = 0; i < len; ++i)
    {
        h = 17 * h + (data[i] - ' ');
    }
    return h ^ (h >> 16);
}

uint32_t apply_hash(int hash, const char* line)
{
    switch (hash) {
    case 1: return crc32b((const uint8_t*)line);
    case 2: return MurmurOAAT_32(line, SEED);
    case 3: return FNV(line, strlen(line), SEED);
    case 4: return Jenkins_one_at_a_time_hash(line, strlen(line));
    case 5: return DJB2_hash((const uint8_t*)line);
    case 6: return KR_v2_hash(line);
    case 7: return Coffin_hash(line);
    case 8: return x17(line, strlen(line), SEED);
    default: break;
    }
    return 0;
}

int main(int argc, char* argv[])
{
    // Read arguments
    const int hash_choice = atoi(argv[1]);
    char const* const fn = argv[2];

    // Read file
    FILE* f = fopen(fn, "r");

    // Read file line by line, calculate hash
    char line[MAXLINE];
    while (fgets(line, sizeof(line), f)) {
        line[strcspn(line, "\n")] = '\0';   // strip newline
        uint32_t hash = apply_hash(hash_choice, line);
        printf("%08x\n", hash);
    }
    fclose(f);

    return 0;
}

P.S. A more comprehensive review of speed and quality of modern hash functions can be found in SMHasher repository of Reini Urban (rurban). Notice the "Quality problems" column in the table.

Andriy Makukha
  • 7,580
  • 1
  • 38
  • 49
11

Wikipedia shows a nice string hash function called Jenkins One At A Time Hash. It also quotes improved versions of this hash.

uint32_t jenkins_one_at_a_time_hash(char *key, size_t len)
{
    uint32_t hash, i;
    for(hash = i = 0; i < len; ++i)
    {
        hash += key[i];
        hash += (hash << 10);
        hash ^= (hash >> 6);
    }
    hash += (hash << 3);
    hash ^= (hash >> 11);
    hash += (hash << 15);
    return hash;
}
RushPL
  • 4,732
  • 1
  • 34
  • 44
9

djb2 is good

Though djb2, as presented on stackoverflow by cnicutar, is almost certainly better, I think it's worth showing the K&R hashes too:

One of the K&R hashes is terrible, one is probably pretty good:

  1. Apparently a terrible hash algorithm, as presented in K&R 1st edition. This is simply a summation of all bytes in the string (source):
    unsigned long hash(unsigned char *str)
    {
        unsigned int hash = 0;
        int c;
    
        while (c = *str++)
            hash += c;
    
        return hash;
    }
    
  2. Probably a pretty decent hash algorithm, as presented in K&R version 2 (verified by me on pg. 144 of the book); NB: be sure to remove % HASHSIZE from the return statement if you plan on doing the modulus sizing-to-your-array-length outside the hash algorithm. Also, I recommend you make the return and "hashval" type unsigned long, or even better: uint32_t or uint64_t, instead of the simple unsigned (int). This is a simple algorithm which takes into account byte order of each byte in the string by doing this style of algorithm: hashvalue = new_byte + 31*hashvalue, for all bytes in the string:
    unsigned hash(char *s)
    {
        unsigned hashval;
    
        for (hashval = 0; *s != '\0'; s++)
            hashval = *s + 31*hashval;
        return hashval % HASHSIZE;
    }
    

Note that it's clear from the two algorithms that one reason the 1st edition hash is so terrible is because it does NOT take into consideration string character order, so hash("ab") would therefore return the same value as hash("ba"). This is not so with the 2nd edition hash, however, which would (much better!) return two different values for those strings.

The GCC C++11 hashing function used by the std::unordered_map<> template container hash table is excellent.

The GCC C++11 hashing functions used for unordered_map (a hash table template) and unordered_set (a hash set template) appear to be as follows.

Code:

// Implementation of Murmur hash for 32-bit size_t.
size_t _Hash_bytes(const void* ptr, size_t len, size_t seed)
{
  const size_t m = 0x5bd1e995;
  size_t hash = seed ^ len;
  const char* buf = static_cast<const char*>(ptr);

  // Mix 4 bytes at a time into the hash.
  while (len >= 4)
  {
    size_t k = unaligned_load(buf);
    k *= m;
    k ^= k >> 24;
    k *= m;
    hash *= m;
    hash ^= k;
    buf += 4;
    len -= 4;
  }

  // Handle the last few bytes of the input array.
  switch (len)
  {
    case 3:
      hash ^= static_cast<unsigned char>(buf[2]) << 16;
      [[gnu::fallthrough]];
    case 2:
      hash ^= static_cast<unsigned char>(buf[1]) << 8;
      [[gnu::fallthrough]];
    case 1:
      hash ^= static_cast<unsigned char>(buf[0]);
      hash *= m;
  };

  // Do a few final mixes of the hash.
  hash ^= hash >> 13;
  hash *= m;
  hash ^= hash >> 15;
  return hash;
}

MurmerHash3 by Austin Appleby is best! It's an improvement over even his gcc C++11 std::unordered_map<> hash used above.

Not only is is the best of all of these, but Austin released MurmerHash3 into the public domain. See my other answer on this here: What is the default hash function used in C++ std::unordered_map?.

See also

  1. Other hash table algorithms to try out and test: http://www.cse.yorku.ca/~oz/hash.html. Hash algorithms mentioned there:
    1. djb2
    2. sdbm
    3. lose lose (K&R 1st edition)
Gabriel Staples
  • 36,492
  • 15
  • 194
  • 265
9

djb2 has 317 collisions for this 466k english dictionary while MurmurHash has none for 64 bit hashes, and 21 for 32 bit hashes (around 25 is to be expected for 466k random 32 bit hashes). My recommendation is using MurmurHash if available, it is very fast, because it takes in several bytes at a time. But if you need a simple and short hash function to copy and paste to your project I'd recommend using murmurs one-byte-at-a-time version:

uint32_t inline MurmurOAAT32 ( const char * key)
{
  uint32_t h(3323198485ul);
  for (;*key;++key) {
    h ^= *key;
    h *= 0x5bd1e995;
    h ^= h >> 15;
  }
  return h;
}

uint64_t inline MurmurOAAT64 ( const char * key)
{
  uint64_t h(525201411107845655ull);
  for (;*key;++key) {
    h ^= *key;
    h *= 0x5bd1e9955bd1e995;
    h ^= h >> 47;
  }
  return h;
}

The optimal size of a hash table is - in short - as large as possible while still fitting into memory. Because we don't usually know or want to look up how much memory we have available, and it might even change, the optimal hash table size is roughly 2x the expected number of elements to be stored in the table. Allocating much more than that will make your hash table faster but at rapidly diminishing returns, making your hash table smaller than that will make it exponentially slower. This is because there is a non-linear trade-off between space and time complexity for hash tables, with an optimal load factor of 2-sqrt(2) = 0.58... apparently.

Wolfgang Brehm
  • 1,491
  • 18
  • 21
9

There are a number of existing hashtable implementations for C, from the C standard library hcreate/hdestroy/hsearch, to those in the APR and glib, which also provide prebuilt hash functions. I'd highly recommend using those rather than inventing your own hashtable or hash function; they've been optimized heavily for common use-cases.

If your dataset is static, however, your best solution is probably to use a perfect hash. gperf will generate a perfect hash for you for a given dataset.

Nick Johnson
  • 100,655
  • 16
  • 128
  • 198
  • hsearch searches by comparing the strings or string ptr address? I think it is just checking the ptr address? I tried using different pointers but same string calue. hsearch fails stating no elements found – Sandeep Jul 05 '16 at 13:45
2

I have tried these hash functions and got the following result. I have about 960^3 entries, each 64 bytes long, 64 chars in different order, hash value 32bit. Codes from here.

Hash function    | collision rate | how many minutes to finish
==============================================================
MurmurHash3      |           6.?% |                      4m15s
Jenkins One..    |           6.1% |                      6m54s   
Bob, 1st in link |          6.16% |                      5m34s
SuperFastHash    |            10% |                      4m58s
bernstein        |            20% |       14s only finish 1/20
one_at_a_time    |          6.16% |                       7m5s
crc              |          6.16% |                      7m56s

One strange things is that almost all the hash functions have 6% collision rate for my data.

Sean Bright
  • 118,630
  • 17
  • 138
  • 146
  • 1
    While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. – thewaywewere Jun 29 '17 at 09:22
  • Upvoted for a good table, put posting the source code for each of those hashes in your answer is essential too. Otherwise, the links may break and we are out of luck. – Gabriel Staples Aug 11 '17 at 17:57
  • 1
    The expected number of collisions should be 9.112499989700318E+7 or 0.103 * 960³ if the hashes were truly random, so I would not have been surprised if they were all around that value, but 0.0616 * 960³ seems a bit off, almost as if the hashes are more evenly distributed than what would be expected by chance, and at 64 bytes length this limit should definitely be approached. Can you share the set of strings that you hashed so I can try to reproduce it? – Wolfgang Brehm Jan 20 '20 at 22:47
  • Anyone know what `14s only finish 1/20` means? I don't understand that line. – Gabriel Staples Feb 22 '22 at 16:22
2

First, is 40 collisions for 130 words hashed to 0..99 bad? You can't expect perfect hashing if you are not taking steps specifically for it to happen. An ordinary hash function won't have fewer collisions than a random generator most of the time.

A hash function with a good reputation is MurmurHash3.

Finally, regarding the size of the hash table, it really depends what kind of hash table you have in mind, especially, whether buckets are extensible or one-slot. If buckets are extensible, again there is a choice: you choose the average bucket length for the memory/speed constraints that you have.

Pascal Cuoq
  • 79,187
  • 7
  • 161
  • 281
  • 1
    The expected number of hash collisions is `n - m * (1 - ((m-1)/m)^n) = 57.075...`. 40 collisions is better than what could be expected by chance (46 to 70 at a p-score of 0.999). The hash function in question is more uniform than if it were random or we are witnessing a very rare event. – Wolfgang Brehm Jan 24 '20 at 10:54
0

I want to summarize it all for newbies to C like me. According to Andriy Makukha precision efforts, MurmurHash3 is the best:

unsigned long hash(const char* str){
    unsigned int h = 0x12345678;
    for (; *str; ++str) {
        h ^= *str;
        h *= 0x5bd1e995;
        h ^= h >> 15;
    }
    return h;
}
Peyman
  • 3,097
  • 5
  • 33
  • 56
0

One thing I've used with good results is the following (I don't know if its mentioned already because I can't remember its name).

You precompute a table T with a random number for each character in your key's alphabet [0,255]. You hash your key 'k0 k1 k2 ... kN' by taking T[k0] xor T[k1] xor ... xor T[kN]. You can easily show that this is as random as your random number generator and its computationally very feasible and if you really run into a very bad instance with lots of collisions you can just repeat the whole thing using a fresh batch of random numbers.

Michael Nett
  • 375
  • 1
  • 10
  • If I'm not mistaken this suffers from the same problem as K&R 1st in Gabriel's answer; i.e. "ab" and "ba" will hash to the same value. – Johann Oskarsson Sep 15 '19 at 08:07