Hash function for clustered integers

Question

I am trying to set up a hash table (in C++, using the unordered_map container) which is to contain 1875 integer items which are randomly distributed within the interval 0 to 4891. Now my problem is that the distribution within this interval is not uniform but rather looks like this:

where each of the 1875 random integers is plotted as a point with x corresponding to the integer value and y = 1 (so as to visualize the distribution).

Clearly the distribution is such that there are wide gaps where no random integer lie. If I use the identity function as my hash function (i.e. use the random integers themselves as hash values), I get 714 empty buckets, 814 buckets with a single element, 499 buckets with 2 elements and 21 buckets with 3 or more elements.

I'm using the Intel C++ compiler and it uses powers of 2 for the number of buckets in the hash table. In my case right now the hash table has 2^11 = 2048 buckets.

What would be a good hash function for this case? My understanding is that a good hash function in this case would get rid of these clustered integer numbers and shuffle them around in a more uniform distribution, but how could one achieve that?

So you only have 1800 integers? How about a sorted `std::vector` of 1800 Key value pairs and then binary searching through it? This will at least be worth measuring. — Baum mit Augen, Nov 12 '15 at 21:29
Boost's `flat_map` [(overview)](http://www.boost.org/doc/libs/1_59_0/doc/html/container/non_standard_containers.html#container.non_standard_containers.flat_xxx) [(API docs)](http://www.boost.org/doc/libs/1_59_0/doc/html/boost/container/flat_map.html) is a nice, full implementation of what @BaummitAugen suggests--its interface is like an unordered_map, but it's implemented like an ordered vector. — Mark Waterman, Nov 12 '15 at 21:39
Cool suggestions, I actually didn't know about sorted vectors. But like a binary tree, my understanding is that for both sorted vectors and binary trees, the lookup operation scales as O(log n). Lets say that I need this data structure to scale to much larger numbers and still have O(1) lookups. — user3208430, Nov 12 '15 at 21:46
@user3208430 This O(.) stuff only matters if you cannot give an upper bound, if you can (you almost always can, and be it the size of the memory of your target machines), you can measure for this upper bound (and measure for the most likely order of magnitude first of course). Also, the logarithm is bounded for all uses in the foreseeable future. Just think, even if you have n=2^64 elements in your collection (that is, your collection of `char` fills the *entire* 18.45 EB RAM of a 64 bit machine), log_2 (n) still is only 64 which about the cost of a *single* cache miss (L1 vs RAM). — Baum mit Augen, Nov 12 '15 at 21:55
Assume you need to perform n lookups on your data structure, then even n = 2^10 becomes significantly more expensive. — user3208430, Nov 12 '15 at 22:02
@user3208430, I'm coming around--[here's a benchmark](http://stackoverflow.com/questions/21166675/boostflat-map-and-its-performance-compared-to-map-and-unordered-map) of the various associative containers, and unordered_map did very well compared to a flat_map for random searches (though a flat_map rocks for iteration). — Mark Waterman, Nov 12 '15 at 22:05
You could just run it through murmurhash3 [(source)](https://code.google.com/p/smhasher/source/browse/trunk/MurmurHash3.cpp), which is highly regarded. Your input is very small (just one block for its algorithm to process), but it might shuffle things around enough to suit your needs, and it'll be fast! — Mark Waterman, Nov 12 '15 at 22:09
@user3208430 As I said above, one should of course always measure which is faster if performance matters. I'm just saying that the common assumption *"Oh, I have a couple of billion elements, that's many, thus O(1) better than O(n)"* often enough does not hold. Again, always measure when talking about performance. — Baum mit Augen, Nov 12 '15 at 22:10
The distribution of your bucket loads is actually *better* than that of any *standard* hashfunction. An option could be unsigned multiply by a (large) odd number. — wildplasser, Nov 18 '15 at 12:14
@wildplasser, tried that but it seems to bring me closer to the case of a uniform distribution with repetition (see my "answer" below) and tends to create more collisions — user3208430, Nov 18 '15 at 17:10
How about having 4891 buckets, and using the identity function as hash function. It could take less space , since you need no pointers and overflow chains and all that stuff. (it is called indexing) And a 30% load factor is not so weird, if it guarantiees you perfect *hashing* ... — wildplasser, Nov 18 '15 at 17:20
I appreciate your suggestion wildplasser. That solution is indeed quite possible, except that in my specific use case (I know I don't mention this in the question), the solution has to scale up to much higher number of elements. My integers are in fact node indices in the context of a meshed volume and the integers to be placed in the hash table are the boundary nodes indices, chosen among all the volumetric nodes indices. While in the case of 4891 volumetric nodes I have 1875 boundary nodes, for much higher number of volumetric nodes, the number of boundary becomes a much smaller fraction. — user3208430, Nov 18 '15 at 17:39
I don't follow completely, but either you have two key-domains to be hashed (could it be nested?) or you'll can have list-members with the 1:N related adjacent items or their keys. BTW: is the structure *static* or are there deletions/insertions taking place?[I suggest you open a new question with the actual problem] — wildplasser, Nov 18 '15 at 19:28

score 1 · Answer 1 · answered Nov 12 '15 at 21:48

I've found that Pearson's Hash Function is an excellent way to get randomness:

https://en.wikipedia.org/wiki/Pearson_hashing

Basically, the idea is that it generates a bunch of VERY random numbers into an Array of 256 bins by default, but you might be able to modify it to 1800 for you scenario. The important thing is that the array is small enough to fit into memory.

score 1 · Answer 2 · answered Nov 12 '15 at 22:24

If you need to reduce the number of collisions, it may help to look at a specialized hashing scheme like cuckoo hashing. Essentially you amortize over multiple hash functions to preserve O(1) complexity.

If the collisions are inexpensive though (e.g. they fit on a cache line or they're predictable) you'll still probably see better performance regardless of asymptotic cost with collisions.

Flat structures tend to be used for this reason, since they have good cache characteristics. It's also one of the reasons they tend to be preferred when performance is important.

score 0 · Answer 3 · edited Apr 13 '17 at 12:19

So I spent some time trying different things out. Here are my conclusions so far.

First, one has to realize that fitting 1875 elements into a hash table with 2048 buckets is likely to produce quite many collisions. Indeed, if one considers that each element has equal probability of being assigned to any one of the 2048 buckets, then the expected number of collisions is 646 (by an argument similar to the so-called birthday problem, see https://math.stackexchange.com/questions/35791/birthday-problem-expected-number-of-collisions?rq=1, the formula is expected nb. of collisions = n – N * (1 – (1 – 1/N)^n) where n is the number of elements and N is the number of buckets). This would be the case if for example the 1875 elements were chosen randomly within the [0, 2047] interval with repetitions allowed or if the 1875 elements were chosen randomly in a very large interval with respect to the number of buckets 2048 with or without repetitions.

Bearing this in mind, the 541 collisions obtained with the identity function as the hash function (see the original question) does not seem too bad. The reason why the number of collisions is smaller than in the uniform distribution case despite those large gaps in the distribution is that by the nature of the problem, the 1875 elements have distinct values and therefore only elements larger than 2048 can cause collisions as they are wrapped around using the modulo operator.

Now, we know that a hash function that maps our input interval [0, 4891] to a much larger interval (like a 32 bits integer for example) randomly and uniformly is not desirable since it will cause more collisions than the identity hash function. However one could wonder if it would be possible to have a random and uniform mapping from the input interval [0, 4891] to some not overly large interval (it could be the same [0, 4891] interval or any other interval like [0, 2048], [0, 5000], etc.) which would reduce collisions. I have tried Pearson-like mappings as suggested by rts1 but found that it doesn't improve the number of collisions.

So far what I've settled on using is simply the identity function as the hash function combined with making sure that my number of elements is not too close to my number of buckets (1.5 times the number of elements seems pretty reasonable for the number of buckets).

Hash function for clustered integers

3 Answers3