6

I need to generate hash keys for about N=100 million keys. From my study it appears that murmur3 (MurmurHash3_x86_32, see murmur3 hash) would be the fastest hashing function with best latency and small enough collision rate. The problem I am facing is that the function returns key as void *. More specifically, the template is:

void MurmurHash3_x86_32 (const void *key, int len, uint32_t seed, void *out);

As my hash table size would be smaller than the largest hash it can generate, I need to get it into the table range [0, N-1]. The easiest solution seems to be to use % operator. But as it is known to be a slow operator, I wonder if there is a quicker way to solve the problem.

One interesting suggestion I found was given Is there an alternative to using % (modulus) in C/C++? on StackOverflow itself. It suggests 'a power of two, the following works (assuming twos complement representation)':

return i & (n-1);

My issue with this is that on newer CPUs, it is sometimes (or is it most of the times?), the performance degrades at around sizes 2^n, IIRC, due to multiway cache lines. (This link provides an illustration regarding insertions Big Memory, Part 3.5: Google sparsehash!).

At the moment, the advantages of murmur3 seems to be nullified by hard-ware related issues and known inefficiency of % operator. As performance is a constraint, I request for low latency and faster solutions to my requirement even if it is not MurmurHash3_x86_32.

Community
  • 1
  • 1
Quiescent
  • 1,088
  • 7
  • 18
  • Don't avoid the `%` operator just because it is slow, if you are trying to take a 128-bit number and fit it into an arbitrarily sized space you are going to have a hard time doing better. Also can you give a better example of performance degradation around 2^n? – Guvante Jun 18 '15 at 20:09
  • Have you actually checked (profiled) that the cost of `%` is high compared to the cost of the murmur3 hashing? – Mark B Jun 18 '15 at 20:10
  • I think you misinterpreted the performance degradations in "Big Memory, Part 3.5". They are there because of the logarithmic growth of the table. If you preallocate the whole table you won't see this. – Daniel Jun 18 '15 at 20:21
  • See http://igoro.com/archive/gallery-of-processor-cache-effects/comment-page-2/ for details for the cache related observation. I have not done the benchmarking taking the performance of % as a given. – Quiescent Jun 18 '15 at 20:28
  • The performance of one `%` operation is immaterial compared to the cost of computing one hash value on which to operate. – John Bollinger Jun 18 '15 at 20:32
  • Agree, it sounds like a memory allocation issue rather than the price of logic operators – Jens Munk Jun 18 '15 at 20:32

1 Answers1

4

The problem I am facing is that the function returns key as void *.

It does not. It returns nothing (void). The hash result is recorded in the buffer you specify (a pointer to) via last argument. For MurmurHash3_x86_32(), it makes the most sense for that to be a pointer to a uint32_t.

As my hash table size would be smaller than the largest hash it can generate, I need to get it into the table range [0, N-1]. The easiest solution seems to be to use % operator. But as it is known to be a slow operator, I wonder if there is a quicker way to solve the problem.

The % is not only the easiest solution, but the most usual one. "Slow" is relative -- % is slower than +, but much, much faster than one call to MurmurHash3_x86_32().

One interesting suggestion I found [...] suggests [using a power-of-two table size, and computing the modulus via the & operator]

Note that contrary to the assertion in the SO answer, in fact this has no dependency at all on twos' complement representation.

My issue with this is that on newer CPUs, it is sometimes (or is it most of the times?), the performance degrades at around sizes 2^n, IIRC, due to multiway cache lines. (This link provides an illustration regarding insertions Big Memory, Part 3.5: Google sparsehash!).

The performance degredation described in the report you linked is attributed to re-hashing, which seems quite plausible. That has nothing to do with the operations you're asking about. It is conceivable that cache (lack of) associativity could impact performance for large hash tables, but probably not more so than having large hash tables does generally. The memory access patterns inherent in use of a hash table naturally produce poor cache locality. That's practically the point.

At the moment, the advantages of murmur3 seems to be nullified by hard-ware related issues and known inefficiency of % operator. As performance is a constraint, I request for low latency and faster solutions to my requirement even if it is not MurmurHash3_x86_32.

You are way overthinking this. Lack of effective use of your CPU cache is simply a price you pay for using a large hash table. It is not associated with the hash function (as long as the hash function does its job well). The cost of a single arithmetic operation, whether it be % or &, will not be noticeable compared to the cost of computing a hash to operate on, so it hardly matters which of those you choose. If you want a tiny advantage for that operation then use a power-of-two sized table and the & operator. On the other hand, that throws away some of the hash bits that you went to so much trouble to compute. Consider instead choosing a prime hash table size and the % operator -- then all the hash bits will contribute to bucket selection, which may improve your spread.

John Bollinger
  • 160,171
  • 8
  • 81
  • 157
  • (a) Yes, I made a mistake of the return type. Sorry! (b) I am now convinced that '%' may not make much difference. (c). The performance reduction is not due to rehashing, but due to N-way associative cache. "Each memory chunk can be stored in any one of N particular slots in the cache. As an example, in a 16-way cache, each memory chunk can be stored in 16 different cache slots. Commonly, chunks with indices with the same lowest order bits will all share 16 slots." (http://igoro.com/archive/gallery-of-processor-cache-effects/) Also in other places, such as in talk of Scott Meyers on Cpu Cache – Quiescent Jun 20 '15 at 10:34
  • Yes, an n-way associative CPU cache is subject to a higher eviction rate under some access patterns than is a fully-associative one. This hasn't much to do with the size of your hash table, however, especially if the table is large, and it is probably not responsible for the performance dips in the `sparsehash` article you linked in your question (the article itself attributes it to rehashing). Use of a hash table typically produces essentially random memory access patterns, which is bad for cache locality, but for that reason also minimizes differences from different cache associativities. – John Bollinger Jun 20 '15 at 14:15