How do I optimize/improve upon this hash function

Question

I have a hashtable that stores quadtree entries.
The hashfunction looks like this:

Quadtree hash

#define node_hash(a,b,c,d) \
  (((int)(d))+3*(((int)(c))+3*(((int)(b))+3*((int)(a))+3)))

Note that the result of this operation is always chunked down using a modulus prime number like so:

h = node_hash(p->nw, p->ne, p->sw, p->se) ;
h %= hashprime ;
...

Comparison with optimal hash
Some statistical analysis shows that this hash is optimum in terms of collision reduction.
Given a hashtable with b buckets and n entries. The collision risk using a perfect hash is:
(n - b * (1 - power((b-1)/b,n)))) * 100 / n
When n = b this means a collision risk of 37%.

Some testing shows that the above hash lines up very nicely with the norm (for all fill levels of the hashtable).

Running time
The runtime is heavily dependent on the value of hashprime

Timings (best out of 1000 runs) are:

hashprime   CPU-cycles per run
--------------------------------
 4049               56
16217               68
64871              127    <-- whoooh

Is there a way to improve on this, whilst still retaining the optimum collision risk?

Either by optimizing the modulus operation (replacing it with a multiplication using 'magic' numbers computer outside the loop).
Replacing the hash function with some other hash function.

Background
The following assembly is produced:

//--------h = node_hash(p->nw, p->ne, p->sw, p->se) ;
mov eax,[rcx+node.nw]       <<+
lea eax,[eax+eax*2+3]         |
add eax,[rcx+node.ne]         |
lea eax,[eax+eax*2]           +-  takes +/- 12 cycles
add eax,[rcx+node.sw]         |
lea eax,[eax+eax*2]           |
add eax,[rcx+node.se]       <<+
//--------h %= hashprime ;
mov esi,[hashprime]
xor edx,edx
div esi
mov rax,rdx                    <<--- takes all the rest

[EDIT]
I may be able to do something with the fact that:

C = A % B is equivalent to C = A – B * (A / B)
Using the fact that integer division is the same as multiplication by its reciprocal.
Thus converting the formula to C = A - B * (A * rB)
Note that for integer division the reciprocals are magic numbers, see: http://www.hackersdelight.org/magic.htm
C code is here: http://web.archive.org/web/20070713211039/http://hackersdelight.org/HDcode/magic.c

[FNV hashes]

See: http://www.isthe.com/chongo/tech/comp/fnv/#FNV-1a

hash = offset_basis
for each byte to be hashed
 hash = hash xor octet_of_data
 hash = hash * FNV_prime (for 32 bits = 16777619)
return hash

For 4 pointers truncated to 32 bits = 16 bytes the FNV hash takes ~~27 cycles (hand crafted assembly)~~
Unfortunately this leads to hash collisions of 81% where it should be 37%.
Running the full 15 multiplications takes 179 cycles.

Are you sure that 40 extra clock cycles is actually a problem? — Ken Y-N, Sep 27 '13 at 04:09
Yes, because this hashfunction gets executed 13million times on small workloads and many orders of magnitude more on big workloads. It is far and away the most called routine. Accounting for 20% of total running time. — Johan, Sep 27 '13 at 04:18
What about using `unsigned int` instead of `int` and letting the values rollback. Would be equivalent to a `hashprime` value of `UINT_MAX` wouldn't it? — anthonyvd, Sep 27 '13 at 04:28
Perhaps try asking the question on http://math.stackexchange.com instead?. — Ken Y-N, Sep 27 '13 at 04:32
@pwny, hashprime is the size of the hashtable having a hashtable of size `Uint_max` would be **bad** on a 64-bit system :-) besides hashprime needs to be a prime number in order to work efficiently. (see the paragraph on `comparison with optimal hash`). — Johan, Sep 27 '13 at 04:35
@Johan It was more of an option than anything else. My point was that letting an unsigned data type overflow might be more efficient than performing a modulo operation. It might be `unsigned short` or `unsigned char` but the fact that it's the size of the hashtable might limit this possibility. — anthonyvd, Sep 27 '13 at 04:46
@pwny, unless you do a modulo with a **prime** number, the number of hash collisions will increase, defeating the purpose. — Johan, Sep 27 '13 at 04:52
The `lea` idiom has more dependencies than d+c*3+b*9+c*27, which may be faster. Then again what is the range of 'a,b,c,d' ? Knowing this may help in modulo reduction. — Aki Suihkonen, Sep 27 '13 at 04:56
@AkiSuihkonen a,b,c,d are pointers, ergo they run the full 4gb of memory (I only use the lower 32 bits) `(d+3*(c+3*(b+3*a+3)))` does not equal `d+c*3+b*9+a*27` besides that's only 12 cycles. — Johan, Sep 27 '13 at 05:05
I suppose the other angle is do you really need to call it 13 million times? Can you optimise a higher-level algorithm? — Ken Y-N, Sep 27 '13 at 06:48
The final `+3` isn't useful, considering you then do `%= prime`. That effectively rotates your buckets 3 to the end, but doesn't affect collisions. When `x % prime == y % prime`, then also `x+3 % prime == y+3 % prime`. — MSalters, Sep 27 '13 at 07:17
The max value of the revised formula fits 38 bits; Since 2^16 % 64871 == 665, it follows that the final modulo reduction equals to (((temp >> 16)*665 + (temp & 0xffff)) % 64871, which can be evaluated with 32-bit integers, since the max value of the intermediate variable is 0xa640fd66. This may prove as efficient as using the much smaller modulo. — Aki Suihkonen, Sep 27 '13 at 08:08
@MSalters, the final +3 is free because it's encoded in a `lea eax,[eax+edx+3]` — Johan, Sep 27 '13 at 09:32

Johan · Answer 1 · 2013-09-28T05:03:46.023

Replacing modulus by reciprocal multiplication
The main cycle eater in this hash function is the modulus operator.

If you replace this division with a multiplication by the reciprocal the calculation is much faster.
Note that calculating the reciprocal involves 3 divides, so this should only be done when the reciprocal can be reused enough times.

OK, here's the code used: http://www.agner.org/optimize/asmlib.zip

From: http://www.agner.org/optimize/

// ;*************************  divfixedi64.asm  *********************************
// ; Author:           Agner Fog

//extern "C" void setdivisoru32(uint Buffer[2], uint d)
asm
  mov     r8d, edx               // x
  mov     r9, rcx                // Buffer
  dec     r8d                    // r8d = r8d or esi
  mov     ecx, -1                // value for bsr if r8d = 0
  bsr     ecx, r8d               // floor(log2(d-1))
  inc     r8d
  inc     ecx                    // L = ceil(log2(d))
  mov     edx, 1
  shl     rdx, cl                // 2^L (64 bit shift because cl may be 32)
  sub     edx, r8d
  xor     eax, eax
  div     r8d
  inc     eax
  mov     [r9], eax              // multiplier
  sub     ecx, 1
  setae   dl
  movzx   edx, dl                // shift1
  seta    al
  neg     al
  and     al,cl
  movzx   eax, al                // shift 2
  shl     eax, 8
  or      eax, edx
  mov     [r9+4], eax            // shift 1 and shift 2
  ret
end;

and the code for the modulus operation:

//extern "C" uint modFixedU32(uint Buffer[2], uint d)
asm
  mov     eax,  edx
  mov     r10d, edx                // x
  mov     r11d, edx                 // save x
  mul     dword [rcx]              // Buffer (i.e.: m')
  sub     r10d, edx                // x-t
  mov     ecx, [rcx+4]             // shift 1 and shift 2
  shr     r10d, cl
  lea     eax, [r10+rdx]
  mov     cl,  ch
  shr     eax, cl
  // Result:= x - m * fastDiv32.dividefixedu32(Buffer, x);
  mul     r8d                    // m * ...
  sub     r11d, eax              // x - (m  * ...)
  mov     eax,r11d
  ret
end;

The difference in time is as follows:

hashprime   classic hash (mod)  new hash        new          old  
(# of runs)    cycles/run       per run       (no cache)   (no cache)
--------------------------------------------------------------------
 4049               56             21            16.6        51
16217               68           not measured
64871              127             89            16.5        50

Cache issues
The increase in cycle time is caused by the data overflowing the cache, causing main memory to be accessed.
This can be seen clearly when I remove cache effects by hashing the same value over and over.

score 1 · Answer 2 · answered Sep 27 '13 at 05:59

1

Something like this might be useful:

static inline unsigned int hash4(unsigned int a, unsigned int b,
    unsigned int c, unsigned int d) {
  unsigned long long foo = 123456789*(long long)a ^ 243956871*(long long)b
                         ^ 918273645*(long long)c ^ 347562981*(long long)d;
  return (unsigned int)(foo >> 32);
}

Replace the four odd numbers I typed in with randomly generated 64-bit odd numbers; the ones above won't work that great. (64-bit so that the high 32 bits are somehow a random mix of the lower bits.) This is about as fast as the code you gave, but it lets you use power-of-two table sizes instead of prime table sizes without fear.

The thing everyone uses for similar workloads is the FNV hash. I'm not sure whether FNV actually has better properties than hashes of the type above, but it's similarly fast and it's in rather widespread use.

answered Sep 27 '13 at 05:59

tmyklebu

13,915
3
28
57

I've got FNV down to 27 cycles (with a bit of cheating). – Johan Sep 27 '13 at 07:25
Nope, cheating breaks it, back to 179 cycles. – Johan Sep 27 '13 at 08:07
Try the thing I wrote, just use different (and bigger) numbers. Bear in mind that, once your table stops fitting in L1 cache, much of the time is going to be spent on L1 misses. (If you plot performance against table size, you'll see four or five distinct regions where performance is easily modelled, and sharp break points in between where things get dramatically worse. The sharp break points occur when you start incurring L1 misses, L2 misses, TLB misses, etc.) – tmyklebu Sep 27 '13 at 16:04

score 0 · Answer 3 · edited May 23 '17 at 12:16

0

Assuming hashprime is a constant, you could implement the modulo-operation as bitwise masks. I'm not sure about the details, but maybe this answer can push you in the right direction.

edited May 23 '17 at 12:16

Community

1
1

answered Sep 27 '13 at 07:23

larsmoa

12,604
8
62
85

Typically, `hashprime` will evolve as the table grows since it represents the number of buckets in the table. If you control the table, you control the list of `hashprime` and *can* change the operations effected as well. – Matthieu M. Sep 27 '13 at 07:42
@MatthieuM.: Yupp, that's true. I was thinking if the performance was very important and a ballpark estimate of the number of elements in the table the size could be fixed. – larsmoa Sep 27 '13 at 07:48

How do I optimize/improve upon this hash function

3 Answers3