Have a good hash function for a C++ hash table?

Question

I am in need of a performance-oriented hash function implementation in C++ for a hash table that I will be coding. I looked around already and only found questions asking what's a good hash function "in general". I've considered CRC32 (but where to find good implementation?) and a few cryptography algorithms. My table, though, has very specific requirements.

Here's what the table will be like:

100,000 items max
200,000 capacity (so the load is 0.5)
hashing a 6-character string which is a part of English sentence
     examples: "become"    "and he"    ", not "

The number one priority of my hash table is quick search (retrieval). Quick insertion is not important, but it will come along with quick search. Deletion is not important, and re-hashing is not something I'll be looking into. To handle collisions, I'll be probably using separate chaining as described here. I have already looked at this article, but would like an opinion of those who have handled such task before.

If you are desperate, why haven't you put a rep bounty on this? — jmucchiello, Mar 10 '09 at 03:56
rep bounty: i'd put it if nobody was willing offer useful suggestions, but i am pleasantly surprised :) — DV., Mar 10 '09 at 04:00
Anyways an issue with bounties is you can't place bounties until 2 days have passed — Robert Gould, Mar 10 '09 at 04:32
Have you considered using one or more of the following general purpose hash functions: http://www.partow.net/programming/hashfunctions/index.html they are extremely fast and efficient. — , Jan 23 '11 at 10:10
Are you _sure_ that the performance of the hash function is critical? Popular functions are to rotate an `unsigned int` by, say 3 bits and add in the next byte, then reduce modulo a prime, that one works OK for text. Is the input somehow under the (partial) control of potentially malicious parties (they could give you 100,000 strings hashing to a few buckets...)? Then you need to make it hard to cook up data for such an attack, perhaps by "salting" (start with a secret random value for each table) and some cryptographic hash thrown in (but for short strings that might not be very effective). — vonbrand, Jan 21 '13 at 13:12

Robert Gould · Accepted Answer · 2009-03-10T04:23:42.690

24

Now assumming you want a hash, and want something blazing fast that would work in your case, because your strings are just 6 chars long you could use this magic:

size_t precision = 2; //change the precision with this
size_t hash(const char* str)
{
   return (*(size_t*)str)>> precision;
}

CRC is for slowpokes ;)

Explanation: This works by casting the contents of the string pointer to "look like" a size_t (int32 or int64 based on the optimal match for your hardware). So the contents of the string are interpreted as a raw number, no worries about characters anymore, and you then bit-shift this the precision needed (you tweak this number to the best performance, I've found 2 works well for hashing strings in set of a few thousands).

Also the really neat part is any decent compiler on modern hardware will hash a string like this in 1 assembly instruction, hard to beat that ;)

edited Mar 10 '09 at 04:23

answered Mar 10 '09 at 03:37

Robert Gould

68,773
61
187
272

Wow.. could you elaborate what "(*(size_t*)str)>> precision" does? It seems to do some weird pointer cast magic that I can't comprehend. And, "precision" is the number of digits in resulting index? – DV. Mar 10 '09 at 04:05
Yes precision is the number of binary digits – Robert Gould Mar 10 '09 at 04:25
ZOMG ZOMG thanks!!! I'm implementing a hash table with this hash function and the binary tree that you've outlined in other answer. – DV. Mar 10 '09 at 04:30
I recall that shifting left is dividing, while shifting right is multiplying. So, in effect, it's like dividing by 2*precision... – DV. Mar 10 '09 at 04:32
Yup, exactly you are dividing by factors of 2, in my case I divided the number by 4 (2x2) – Robert Gould Mar 10 '09 at 04:33
14

Note that this won't work as written on 64-bit hardware, since the cast will end up using str[6] and str[7], which aren't part of the string. Also, on 32-bit hardware, you're only using the first four characters in the string, so you may get a lot of collisions. – David Norman Mar 10 '09 at 05:14
2

I don't see how this is a good algorithm. The hash output increases very linearly. There's no avalanche effect at all... – Unknown Apr 29 '09 at 03:09
but in practice it's good enough, and I've used it with several thousand strings and it outperforms more traditional hashes. Algorithms aren't everything in practice. A suboptimal but hardware friendly solution can out perform a theoretically better algorithm most of the time, that's why this is a C/C++ solution not a solution for some scripting language for example where we are abstracted away from the hardware. – Robert Gould Apr 29 '09 at 04:34
"Precision" is a complete misnomer... the value serves to discard less significant bits in some of the characters (exactly which ones depends on endianness) - simply reducing the number of distinct values the hash could take. Using `std::uint32_t` and `std::uint16_t` you could safely access all 6 bytes of data (in your implementation, if `size_t` is 32 bits the last two characters aren't hashed, if 64 then there's a memory buffer read overrun which could crash), then probably XOR them given speed's the priority. As the hash is weak, using a prime number of buckets would help here. – Tony Delroy Aug 01 '12 at 01:01
Almost all processor including embedded ones have hardware acceleration for CRC. – Ajinkya Kamat Jul 10 '23 at 22:57

George V. Reilly · Answer 2 · 2015-12-02T20:53:49.923

14

This simple polynomial works surprisingly well. I got it from Paul Larson of Microsoft Research who studied a wide variety of hash functions and hash multipliers.

unsigned hash(const char* s, unsigned salt)
{
    unsigned h = salt;
    while (*s)
        h = h * 101 + (unsigned) *s++;
    return h;
}

salt should be initialized to some randomly chosen value before the hashtable is created to defend against hash table attacks. If this isn't an issue for you, just use 0.

The size of the table is important too, to minimize collisions. Sounds like yours is fine.

edited Dec 02 '15 at 20:53

answered Mar 10 '09 at 06:36

George V. Reilly

15,885
7
43
38

2

And if you can guarentee that your strings are always 6 chars long without exception then you could try unrolling the loop. – Jackson Mar 10 '09 at 09:30
1

(unsigned char*) should be (unsigned char) I assume. – sgraham Jul 07 '13 at 04:21
sgraham: I changed the cast to `(unsigned)` in the loop. – George V. Reilly May 13 '14 at 20:28
The hash table attacks link is broken now. Has it moved ? – Vincent Fourmond Dec 02 '15 at 08:01
Thanks, Vincent. I've updated the link to my post. I've also updated the post itself which contained broken links. – George V. Reilly Dec 02 '15 at 20:54
If I am using rehashing also, what happens when the value of 'h' becomes greater than current capacity of the hash table? – Vishal Chugh Oct 08 '18 at 13:56
Typically the value of `h` is much greater than the size of the hash table. For example, with the shown hash function and a four-character string, h > 1E8 (100^4). Regardless of rehashing, the usual strategy is to calculate `h % bucket_count` to compute the bucket index. – George V. Reilly Oct 10 '18 at 00:23

Ferruccio · Answer 3 · 2009-03-10T03:36:14.660

6

Boost.Functional/Hash might be of use to you. I've not tried it, so I can't vouch for its performance.

Boost also has a CRC library.

I would look a Boost.Unordered first (i.e. boost::unordered_map<>). It uses hash maps instead of binary trees for containers.

I believe some STL implementations have a hash_map<> container in the stdext namespace.

edited Mar 10 '09 at 03:36

answered Mar 10 '09 at 03:10

Ferruccio

98,941
38
226
299

score 4 · Answer 4 · answered Mar 10 '09 at 03:17

The size of your table will dictate what size hash you should use. You would like to minimize collisions of course. I'm not sure what you are specifying by max items and capacity (they seem like the same thing to me) In any case either of those numbers suggest that a 32 bit hash would be sufficient. You might get away with CRC16 (~65,000 possibilities) but you would probably have a lot of collisions to deal with. On the other hand, a collision may be quicker to deal with than than a CRC32 hash.

I would say, go with CRC32. You'll find no shortage of documentation and sample code. Since you have your maximums figured out and speed is a priority, go with an array of pointers. Use the hash to generate an index. On collision, increment index until you hit an empty bucket.. quick and simple.

score 4 · Answer 5 · answered Mar 10 '09 at 03:53

Since you store english words, most of your characters will be letters and there won't be much variation in the most significant two bits of your data. Besides of that I would keep it very simple, just using XOR. After all you're not looking for cryptographic strength but just for a reasonably even distribution. Something along these lines:

size_t hash(const std::string &data) {
  size_t h(0);
  for (int i=0; i<data.length(); i++)
    h = (h << 6) ^ (h >> 26) ^ data[i];
  }
  return h;
}

Besides of that, have you looked at std::tr1::hash as a hashing function and/or std::tr1::unordered_map as an implementation of a hash table? Using these would probably be save much work opposed to implementing your own classes.

thanks for suggestions! could you elaborate what does "h = (h << 6) ^ (h >> 26) ^ data[i];" do? as far as using c++ libraries, i won't be able to since this is a class exercise... — DV., Mar 10 '09 at 04:03
The ^ is the C++ operator for XOR, << and >> are bit shifts left and right to "mix it up" a little... — sth, Mar 10 '09 at 04:17

Robert Gould · Answer 6 · 2009-03-10T03:33:01.550

2

If you need to search short strings and insertion is not an issue, maybe you could use a B-tree, or a 2-3 tree, you don't gain much by hashing in your case.

The way you would do this is by placing a letter in each node so you first check for the node "a", then you check "a"'s children for "p", and it's children for "p", and then "l" and then "e". In situations where you have "apple" and "apply" you need to seek to the last node, (since the only difference is in the last "e" and "y")

But but in most cases you'll be able to get the word after a just a few steps ("xylophone" => "x"->"ylophone"), so you can optimize like this. This can be faster than hashing

edited Mar 10 '09 at 03:33

answered Mar 10 '09 at 03:12

Robert Gould

68,773
61
187
272

Elaborate on how to make B-tree with 6-char string as a key? Thanks! – DV. Mar 10 '09 at 03:14
One more thing, how will it decide that after "x" the "ylophone" is the only child so it will retrieve it in two steps?? – DV. Mar 10 '09 at 03:43
E.g., my struct is { char* data; char link{'A', 'B', .., 'a', 'b', ' ', ..}; } and it will test root for whether (node->link['x'] != NULL) to get to the possible words starting with "x". – DV. Mar 10 '09 at 03:46
When you insert data you need to "sort" it in. Lookup about heaps and priority queues. – Robert Gould Mar 10 '09 at 03:59

score 2 · Answer 7 · answered Mar 10 '09 at 03:25

The number one priority of my hash table is quick search (retrieval).

Well then you are using the right data structure, as searching in a hash table is O(1)! :)

The CRC32 should do fine. The implementation isn't that complex, it's mainly based on XORs. Just make sure it uses a good polynomial.

score 2 · Answer 8 · answered Mar 10 '09 at 03:27

How about something simple:

// Initialize hash lookup so that it maps the characters
// in your string to integers between 0 and 31
int hashLookup[256];

// Hash function for six character strings.
int hash(const char *str)
{
    int ret = 0, mult = 1;
    for (const char *p = str; *p; *p++, mult *= 32) {
        assert(*p >= 0 && *p < 256);
        ret += mult * hashLookup[*p];
    }

    return ret;
}

This assumes 32 bit ints. It uses 5 bits per character, so the hash value only has 30 bits in it. You could fix this, perhaps, by generating six bits for the first one or two characters. If you character set is small enough, you might not need more than 30 bits.

Raedwald · Answer 9 · 2022-02-12T10:08:54.647

1

Since C++11, C++ has provided a std::hash< string >( string ). That is likely to be an efficient hashing function that provides a good distribution of hash-codes for most strings.

Furthermore, if you are thinking of implementing a hash-table, you should now be considering using a C++ std::unordered_map instead.

edited Feb 12 '22 at 10:08

answered Dec 05 '17 at 15:09

Raedwald

46,613
43
151
237

No its not. It only converts the string as unsigned long. Try for yourself what std::hash(42) returns. Dont ask me, why they provide such useless functionality. – Jay-Pi Dec 28 '21 at 00:59
@Jay-Pi `std::hash(int)` and `std::hash(string)` are different functions. – Raedwald Dec 28 '21 at 08:07
Compare std::hash with stoul(l) and you will see that they provide the same stuff. Its unclear to me, why this is not documented in the reference. Because the function not hash, it just creates "a hashable datatype". However I dont undertand how hiding this behind templates makes it simpler to use. – Jay-Pi Dec 28 '21 at 22:02

Have a good hash function for a C++ hash table?

9 Answers9

Linked