I am trying to figure out a fast datatype to store pairs of ints, where the API is just add, remove and isMember
. Considering isMember
must be fast, the obvious idea is to use a hash-map. Hash functions are mostly made for unbounded strings, so, my question is: considering what I am trying to hash is just a pair of ints, what is a fast hash-function with good collision properties?
Asked
Active
Viewed 2,658 times
4

MaiaVictor
- 51,090
- 44
- 144
- 286
-
It will really depend on the distribution of your data, although `num1*prime + num2` (so say `num1*31 + num2`) is usually decent. – Bernhard Barker May 29 '14 at 12:26
-
Do you know anything about the distribution of the integers? If both integers are independently uniformly distributed, just adding them is good enough. If there's any correlation between the two ints, we need more information. – Sven Marnach May 29 '14 at 12:26
-
I don't think there is a relation at all. Not any obvious that I can think of, the data is completely arbitrary afaik. It can be any prime, Dukeling? – MaiaVictor May 29 '14 at 12:27
-
Hmm, may be a dirty hack `long int hash = (i1 << 32) & i2;` – πάντα ῥεῖ May 29 '14 at 12:27
-
http://stackoverflow.com/questions/114085/fast-string-hashing-algorithm-with-low-collision-rates-with-32-bit-integer check out this link for your answer – Venkatesan May 29 '14 at 12:28
-
1@Venkatesan: This is about string hashing, which the OP explicitly is not interested in. – Sven Marnach May 29 '14 at 12:30
-
your ints are of 16 or 32 bits? – 4pie0 May 29 '14 at 12:33
-
If the answer is different, I'd like to hear both (as well as 64 bits, etc). The size of each int depends on the maximum size of a specific program. ~~ I've been thinking in those lines. XOR too. I just wondered if any of those is particularly best suited. – MaiaVictor May 29 '14 at 12:33
-
I don't really know what makes one prime better than another - the standard Java API, for example, uses 31 pretty consistently. – Bernhard Barker May 29 '14 at 12:38
-
1@Viclib: If your the two ints are independently uniformly distributed, even just taking one of the two integers and completely ignoring the other one is as good a hash function as it gets (and certainly a fast one). So without any further restrictions, it's not possible to give you a better answer. – Sven Marnach May 29 '14 at 12:39
-
That is actually a good answer. I might ask it again, except for the specific problem I am trying to solve? – MaiaVictor May 29 '14 at 12:39
-
@SvenMarnach how can it be a good hash function since it mapps all pairs (x,y) to x (ignore y) ? – 4pie0 May 29 '14 at 12:43
-
1@bits_international: Under the ssumption that both of the intergers are independently uniformly ditributed, this will give you a perfectly uniformly distributed hash, so you can't get any better than that, nor any faster. In the real world, your data usually isn't like this, but that's exactly the point I meant to make. – Sven Marnach May 29 '14 at 12:52
-
Okay, [I just posted the question again](http://stackoverflow.com/questions/23933808/what-is-a-fast-data-structure-to-implement-a-pair-storage-as-specified-here), now being more specific about the actual need for that structure so you guys can understand better how the data is supposed to be used. I'm going to delete this question in a few minutes. Thanks! – MaiaVictor May 29 '14 at 12:54
-
@SvenMarnach this is not correct, if you map (x,y)->x, you will have only N buckets, each bucket will contain N pairs, probability of a collision would be 1/N – 4pie0 May 29 '14 at 22:47
-
@bits_international: The whole point of a hash function is that it maps the space of all values to a smaller space, typically an int. A good hash function aims for a uniform random distribution across the hash space. Under the assumption that both of your input integers are already randomly distributed across the whole space, simply taking one of them perfectly achieves this goal, as does taking the sum, XORing them or computing `a + p*b`. In any case, the number of buckets is limited by the number of possible int values. – Sven Marnach May 30 '14 at 10:20
-
@SvenMarnach no, I don't agree – 4pie0 May 30 '14 at 10:58
-
@bits_international: Now you are obviously trolling. :) Let's stop right here. – Sven Marnach May 30 '14 at 11:35
-
@SvenMarnach I am far from it. I stay with what I said, and I don't claim to convince you, it is hard to do here, using comments. There are important differences between univariate variable distribution and joint multivariate distribution, and (x,y)->x wouldn't be correct, it will introduce high probability of collision – 4pie0 May 30 '14 at 11:45
-
@SvenMarnach A small nitpick: if they are signed, overflowing would be a problem; so unsigned addition is needed. – Acorn Jul 13 '19 at 11:33
2 Answers
2
For a pair of int
s you can go for Szudzik's Function. It 'elegantly' pairs two natural numbers to a unique single number.
Since you mentioned int
s, it can also be negative. In that case, use different hashmaps for positive-positive, positive-negative, negative-positive, negative-negative pairs.

double-beep
- 5,031
- 17
- 33
- 41

Arkajyoti Banerjee
- 466
- 5
- 12
0
The best you can get is to use a hash function for long long
(e.g. in C++
it is built in) and use (p.first * (INT_MAX + 1) + p.second)
. This will work quite well in c++11
and also most of the common implementations of hash_map
have a hash function for long long
if this is not available you can use (((long long)p.first * prime1) + (long long)p.second) % prime2
where prime1
and prime2
are prime numbers that fit in integer.

Ivaylo Strandjev
- 69,226
- 18
- 123
- 176
-
-
This approach is very similar to the way hash is computed for strings. In this case the string consists of two characters but from a way bigger alphabet – Ivaylo Strandjev May 30 '14 at 04:26