44

I can't use boost:hash because I have to stick with C and can't use C++.

But, I need to hash a large number (10K to 100k) of tokens strings (5 to 40 bytes length) so that search within those are fastest.

MD5, SHA1 or any long hash function seems too heavy for a simple task, I am not doing cryptography. Plus there is the storage and computing cost.

Therefore my question:

  1. What might be the simplest hash algorithm that will ensure collision prevention in most practical cases.

  2. How many bit to use for the hash value? I am developing for 32 bit systems. Does hash algorithm in Perl/Python use 32 bit hashes too? Or do I have to jump to 64?

  3. Regarding implementation of hash tables in common scripting languages: does the implementation check for collisions or can I avoid that part altogether?

CDR
  • 8,198
  • 11
  • 47
  • 46
  • 25
    The following page has several implementations of general purpose hash functions implemented in C (and many other languages): http://partow.net/programming/hashfunctions/index.html –  Oct 31 '10 at 23:06
  • Have you considered using GLib? https://developer.gnome.org/glib/2.46/glib-Hash-Tables.html – Bastien Léonard Apr 13 '09 at 14:08

6 Answers6

24

You can find a good (and fast) hash function, and an interesting read, at http://www.azillionmonkeys.com/qed/hash.html

The only time you should not check for collisions, is if you use a perfect hash -- a good old fashioned lookup table, like gperf.

gnud
  • 77,584
  • 5
  • 64
  • 78
  • 3
    I would suggest looking at one that Hsieh's analysis missed: MurmurHash2. http://en.wikipedia.org/wiki/MurmurHash – Steven Sudit Jul 10 '09 at 03:30
13
  1. Here is a nice overview of the most notable known hash functions.

  2. 32bits should work just fine.

  3. You always need to check for collisions, unless you want to write a funny hashtable :)

Sean Allred
  • 3,558
  • 3
  • 32
  • 71
arul
  • 13,998
  • 1
  • 57
  • 77
  • You don't need to check for collisions if you don't particularly care about which answer you get. Advantage is that you don't have to store the original key in the hash table so you can save a lot of space. – Zan Lynx Apr 13 '09 at 16:29
  • 2
    Well, such a non-deterministic behavior is what I meant by 'funny'. – arul Apr 13 '09 at 16:39
8

A general hash function for hash table lookup. It specifies Do NOT use for cryptographic purposes, but since you specified that you have no intent for that then you should be ok.

It Included is A Survey of Hash Functions to try out

TStamper
  • 30,098
  • 10
  • 66
  • 73
5

If you're on a posix alike system and sticking to plain C, I would simply use what the system already has to offer. man 3 hcreate offers you all details or you can find an online version here http://linux.die.net/man/3/hcreate

amo-ej1
  • 3,279
  • 26
  • 35
2

Try Adler32 for long strings or Murmur2 for short strings.

  • 3
    Adler32 is not a very good hash at all. In fact, it's even worse than CRC-32, as a hash. Murmur2, on the other hand, is a very fast hash with excellent distribution and worst-case behavior, so there's no reason to limit its use to short strings. I don't really understand the basis of your advice. – Steven Sudit Jul 10 '09 at 03:28
1

xxhash is quite fast and easy option. A simple code would use XXH32 function:

unsigned int XXH32 (const void* input, int len, unsigned int seed);

It is 32 bit hash. Since len is int, for larger data more than 2^31-1 bytes use these:

void*         XXH32_init   (unsigned int seed);
XXH_errorcode XXH32_update (void* state, const void* input, int len);
unsigned int  XXH32_digest (void* state);
Majid Azimi
  • 5,575
  • 13
  • 64
  • 113