Hash function for 3d integer coordinates

Question

Having a 3D uniform grid, in order to save memory in large models the empty cells(those that don't overlap with any object) don't need to be saved. I am using Dictionary in c# for this purpose. Although the performance already has decreased yet this is still better than having exception at the time of creating the 3D grid. Now my problem is to find a fast hash function that maps a 3d integer coordinate of the grid to a unique number.

I already have tried ((x * 73856093 + y * 19349669 + z * 83492791))% n which doesn't always generate a unique number.

If the number can be any size, then you can just do x * MAXINT * MAXINT + y * MAXINT + z. If the number has to be the same size as an integer, then uniqueness is impossible due to the [pigeonhole principle](http://en.wikipedia.org/wiki/Pigeonhole_principle). there are 2^32 possible integer values, and 2^96 possible integer triplet values. You can't fit the latter category into the former without overlap. — Kevin, Sep 03 '14 at 16:58
What are the possible values of the x, y, and z coordinates? — President James K. Polk, Sep 03 '14 at 23:15
@Kevin, I think the max triplet values would be 2^48 as the dimensions are unsigned integer. — ali, Sep 04 '14 at 08:34
@GregS, possible values are (x=826,y=1013,z=275). I don't expect the dimensions to get larger than 5000 or even 2500 in each axis. — ali, Sep 04 '14 at 08:36
Hash values don't need to be unique at all. A collision occurs when 2 hash values are the same for different input values. You only need to keep the collisions to a minimum (to improve performance) which is usually easy, e.g. `hash = (x * 31) + (y * 37) + (z * 41)` is more than adequate. — Floris, Sep 04 '14 at 14:43
A better hash (without collisions in your case) would be `hash = (x * 18397) + (y * 20483) + (z * 29303)` assuming your hash value can be as high as 2^48. — Floris, Sep 04 '14 at 14:51
@Floris, I might be wrong which is highly possible. But does not "2 hash values are the same for different input values" mean that those two hash values are not unique(distinct) among the hash set values? I am using what you have suggested. Only instead these three 73856093, 19349669, 83492791 prime numbers. Does using a larger prime number only decrease the chance of collision and hasn't got anything to do with speed? in terms of performance, how can one measure the formula? is it only through testing? — ali, Sep 04 '14 at 15:46
@Floris: Avoiding collisions would only help if the hash map itself had the same size, i.e. 2^48. Otherwise you have a mod step somewhere in there, so you only deferred the collisions. `hash=x*16777216+y*4096+z` would avoid collisions as well, using only 36 bits (12 for each coordinate). It could be implemented using bit shifts. *But* it would perform *terrible* if reduced modulo some smaller power of two, which is likely to happen in a real hash map. — MvG, Sep 05 '14 at 09:53
True, as explained in my answer the hash given in my comment is not good. I would be surprised though if it performs poorly since my multiplication factors are prime, bottom line though: don't use it. Also your deliberately bad hash function proves a good point. — Floris, Sep 05 '14 at 10:47

score 5 · Accepted Answer · answered Sep 05 '14 at 09:48

On the one hand you write your aim as “save memory“, while on the other hand you ask for “a fast hash function that maps a 3d integer coordinate of the grid to a unique number”. These two are not very compatible.

Either you want to guarantee O(1) access. In that case you have to prevent hash collisions and must map input to unique numbers. But in that case you also need as many cells in your hash map as there are possible inputs. So you would gain no memory saving over a simple N×N×N array.

Or – and this is far more likely – you only want hash collisions to be rare. Then you can have a hash map which is about twice the number of actually stored objects. But in this case, you don't have to completely avoid hash collisions, you only have to make them sufficiently rare.

Choosing a good hash function depends a lot on the likely patterns of your input data. If input is fairly random, and know the size of your hash map, you should aim for uniform distribution. If objects are more likely located in adjacent blocks, then you want to make sure that small changes in coordinates are unlikely to result in a collision. This is the point where it helps to not make your factors primes, so that a small change in one direction is less likely to collide by one in another direction.

If in doubt, you can always test things: Given three prime numbers (e.g. for the hash 137x+149y+163z) and some real-world setups (i.e. used coordinates and resulting hash map size), you can simply apply the hash to all coordinates, mod down to the hash map size and count the number of unique values. Do the same for various triples and choose the one which maximizes that number. But I doubt that level of optimization is really worth the effort.

"you want to make sure that small changes in coordinates are unlikely to result in a collision". I assume this is only true when the x,y,z coordinates have precision. — ali, Sep 05 '14 at 13:27

score 3 · Answer 2 · answered Sep 05 '14 at 08:08

Rather than trying to write a new article on an already well covered topic see the wikipedia article on hash functions. In particular the first image clearly shows how multiple inputs are hashed to the same value.

Basically, your triplet is hashed to some hash value in the range [0,2^64 - 1] (duplicates allowed!). Then the range is reduced to something slightly larger than your number of input values (say n=5) via the equation hash = hash % n. The resulting relation for input values of say [(1,1,1), (1,2,3), (2321, 322, 232), (3,3,3)] will then look something like this:

    (1,1,1)          -> 2
    (1,2,3)          -> 0
    (2321, 322, 232) -> 0
    (3,3,3)          -> 3

As you can see no input value relates (i.e. hashes) to 1 or 4 and there are two input values hashing to 0.

The power of the hash (and the reason the average case is O(1)) is made clear by noting that in order to retrieve an input value from the hash table (e.g. (1,1,1)) the following steps occur.

Input value's hash is calculated and hash = hash % n is applied, therefore (1,1,1) -> 2.
A direct O(1) lookup is performed, i.e. hash_function[2] = (1,1,1) + additional data stored with this particular input value.
Done!

In the case where more than one input value maps to the same hash value (0 in our example), the internal algorithm needs to do a search on those input values which is often done using a red-black tree (worst case O(log n)). The worst case for any lookup is thus also O(log n).

A perfect hash occurs when the relation becomes a one-to-one onto function (a bijection). This gives best performance but is rare. As I stated earlier, luckily it is easy to produce an almost perfect hash where duplicates are scarce. In essence make your hash function as random as possible.

The examples I gave in the comments might be adequate (and the wrong way to do it ): ) but a more standard caculation would be: hash = ((((prime1 + value1) * prime2) + value2) * prime3) + value3) * prime4

which also answers the question. Note that the prime numbers can be any prime but usually small values like 31,37, etc. are used in practice.

In practice testing can be used to check the performance but is usually not necessary.

In any case re-reading your question I am wondering why you are not dropping the entire hash idea and not just store your points in a simple array??

Can you name an implementation which does collision handling using Red-Black trees? At least [Mono does not](https://github.com/mono/mono/blob/cfe97bf3a54163fcb639066f0bf56fa401bf6ab0/mcs/class/corlib/System.Collections/Hashtable.cs#L615). Furthermore, a perfect hash has to be injective, but not surjective: you don't require a possible input for every element of the hash result range. — MvG, Sep 05 '14 at 10:12
You are right - I spoke out of memory when mentioning Red-Black trees. Also, I took the range to be the image of the hash function (which is usual in mathematics) but does not make sense in this context. — Floris, Sep 05 '14 at 11:13
With big models in large extents it ran out of memory to create certain number of cells for the grid, for small cell dimension. Whereas a Dictionary like collection gives the opportunity to only store cells that overlap with triangles. — ali, Sep 05 '14 at 11:28

Hash function for 3d integer coordinates

2 Answers2

Linked

Related