Why std::hash seems to be identity function

Question

#include <iostream>

int main() {
    std::hash<int> hash_f;
    std::cout << hash_f(0) << std::endl;
    std::cout << hash_f(1) << std::endl;
    std::cout << hash_f(2) << std::endl;
    std::cout << hash_f(3) << std::endl;
}

I compile with "g++ main.cpp -std=c++11" and then the result is :

Why is this happening? I don't use any library and I don't have a specialized hashing function.

Addendum : I wanted to define the hash for an unordered_set of unordered_set of int with the hash of a set being the sum of its components hashs, but if it's just identity it's not cool because the hash of {2,4} is the same than the hash of {1,5}. The simplest way to avoid that is may be to use the std::hash double function.

Identity is the obvious choice for a simple hash over integers. (Obviously, it would be a poor choice for a cryptographic hash - that needs to be a one-way function). — Toby Speight, Jul 11 '16 at 10:58
Related: http://stackoverflow.com/q/11890805/ and http://stackoverflow.com/q/3893782/ — dyp, Jul 11 '16 at 11:54
I don't really understand your addendum. Either way please do not change the question after-the-fact. — Lightness Races in Orbit, Jul 11 '16 at 11:55
@LightnessRacesinOrbit These are *intentionally* not the same languages. Other languages have to solve a similar problem, albeit with different trade-offs. — dyp, Jul 11 '16 at 13:02
This might help: https://github.com/HowardHinnant/hash_append — Howard Hinnant, Jul 11 '16 at 16:34

score 12 · Accepted Answer · answered Jul 11 '16 at 10:47

It seems its identity, its allowed as its distinct.. From cpp reference

The actual hash functions are implementation-dependent and are not required to fulfill any other quality criteria except those specified above. Notably, some implementations use trivial (identity) hash functions which map an integer to itself. In other words, these hash functions are designed to work with unordered associative containers, but not as cryptographic hashes, for example. ....

score 8 · Answer 2 · answered Jul 11 '16 at 11:00

8

It seems completely reasonable for the hash function int→int to be identity, and it's not clear why you are surprised by that. Performing any further computation would be pointless. This is, in fact, a perfect hash in every sense of the term.

Remember, std::hash is supposed to (almost uniquely) identify values, not encrypt them.

It's only when you want to hash types larger than that of the hash itself (say, uint9999999_t) that you need to do some work to "compress" the value into the size of the hash.

answered Jul 11 '16 at 11:00

Lightness Races in Orbit

378,754
76
643
1,055

1

For an input which is bounded by values greater than int min and smaller than int max, e.g. (-1000, +1000), the distribution will be ... not optimal. – dyp Jul 11 '16 at 11:03
@dyp: If you have such an input then why are you storing it in an `int`? You're always free to provide your own range-aware hash function customised for your application, but I see no reason for the default implementation to make _any_ assumptions in that regard. – Lightness Races in Orbit Jul 11 '16 at 11:04
1

Add a single entry of `INT_MAX` and you have the same issue, except that you have to use `int`. To me, it is debatable whether or not the default implementation should try to compensate common skews in the input data; therefore to me the question is, *what are the arguments why not to try to do that?* – dyp Jul 11 '16 at 11:18
1

@dyp: I don't think you can come up with any "common skews". You're chasing optimisations without any data! If the default implementation optimises for case _X_ then it's suddenly suboptimal for case _Y_. Let the default do the sensible, obvious thing then replace it with your own specific hash function if you need one. – Lightness Races in Orbit Jul 11 '16 at 11:26
Actually, thinking about it again, the modulo applied after the hash will distribute things too, so the impact is not as high as I originally thought. Distributions that collide with a trivial hash functions look much more weird, e.g. (N, N+10) where N is the applied modulo. -- Yes, I don't have any data for "common skews", nor do I know if there are few enough of them to optimize for. But it's not me who decides on a default; a default should be there if it works well enough for many cases and then you need those cases to test. Otherwise, no default should be provided IMHO. – dyp Jul 11 '16 at 11:53
according to cppreference.com, the hash function object seems to be `int-->size_t`, not `int-->int`. – Jul 11 '16 at 20:05
3

*"Performing any further computation would be pointless. This is, in fact, a perfect hash in every sense of the term."* - this completely misses the trade-offs here, namely between identity hashes *post modulo table size* (which is what matters) being extremely collision prone in the worst case (even with a prime number of buckets), vs saving CPU time on stronger hashing, folding near-incrementing values across buckets more uniformly than stronger hashing does, and (very minor benefit) having better cache locality if lookups happen to be done in order of increasing key. – Tony Delroy May 30 '19 at 01:51

score 6 · Answer 3 · edited May 23 '17 at 12:01

6

The other answers cover the rationale behind the identity function very well. To address your addendum:

I wanted to define the hash of an unordered_set as the sum of its components hashs, but if it's just identity it's not cool because the hash of {2,4} is the same than the hash of {1,5}. The simplest way to avoid that is may be to use the std::hash function.

As you can see, using the + operator to combine hashes is not the best idea. To be more robust, you could use the XOR (^) operator, or take inspiration from the approach taken, e.g., by boost::hash_combine (details in this SO post):

seed ^= hash_value(v) + 0x9e3779b9 + (seed << 6) + (seed >> 2);

As an example, for your two integer pairs (1,5 / 2,4), and a seed of 0, this would work out to

uint32_t seed = 0;
seed ^= 1 + 0x9e3779b9 + (seed << 6) + (seed >> 2);
seed ^= 5 + 0x9e3779b9 + (seed << 6) + (seed >> 2);
// 3449077526

uint32_t seed = 0;
seed ^= 2 + 0x9e3779b9 + (seed << 6) + (seed >> 2);
seed ^= 4 + 0x9e3779b9 + (seed << 6) + (seed >> 2);
// 3449077584

edited May 23 '17 at 12:01

Community

1
1

answered Jul 11 '16 at 12:01

mindriot

5,413
1
25
34

why you can't just use : seed ^= x ? why do you need + 0x9e3779b9 + (seed << 6) + (seed >> 2) ? – François Jul 11 '16 at 12:04
You can just go for a simple XOR (as I wrote); but have a look at the [SO article I linked above](http://stackoverflow.com/questions/4948780/magic-number-in-boosthash-combine) for some rationale behind the Boost approach. – mindriot Jul 11 '16 at 12:08
1

A simple practical example: combining the hashes of two 0s would in this case be `0 ^ 0`, which is again 0. But you could argue that two 0s should produce a different hash from a single 0. Another rationale is that two consecutive hash values should be far apart from each other. – mindriot Jul 11 '16 at 12:09
Thanks a lot and sorry, I can't select 2 answers, I should have ask an other question for the addendum and select your answer. – François Jul 11 '16 at 12:16
No worries, it's fine as it is. – mindriot Jul 11 '16 at 12:30

Why std::hash seems to be identity function

3 Answers3

Linked