17

Recently, I was curious how hash algorithms for floating points worked, so I looked at the source code for boost::hash_value. It turns out to be fairly complicated. The actual implementation loops over each digit in the radix and accumulates a hash value. Compared to the integer hash functions, it's much more involved.

My question is: why should a floating-point hash algorithm be any more complicated? Why not just hash the binary representation of the floating point value as if it was an integer?

Like:

std::size_t hash_value(float f)
{
  return hash_value(*(reinterpret_cast<int*>(&f)));
}

I realize that float is not guaranteed to be the same size as int on all systems, but that sort of thing could be handled with a few template meta-programs to deduce an integral type that is the same size as float. So what is the advantage of introducing an entirely different hash function that specifically operates on floating point types?

mskfisher
  • 3,291
  • 4
  • 35
  • 48
Channel72
  • 24,139
  • 32
  • 108
  • 180
  • 1
    As written you violate the strict-aliasing rules (and worse if `int` isn't the same size as `float`), but I'd be curious for an answer involving converting it to a `char*` of given length. – Mark B Sep 13 '11 at 14:09
  • possible duplicate of [Hash function for floats](http://stackoverflow.com/questions/4238122/hash-function-for-floats) – legends2k Mar 27 '14 at 10:42

4 Answers4

6

Take a look at https://svn.boost.org/trac/boost/ticket/4038

In essence it boils down to two things:

  • Portability: when you take the binary representation of a float, then on some platform it could be possible that a float with a same value has multiple representations in binary. I don't know if there is actually a platform where such an issue exists, but with the complication of denormelized numbers, I'm not sure if this might actually happen.

  • the second issue is what you proposed, it might be that sizeof(float) does not equal sizeof(int).

I did not find anyone mentioning that the boost hash indeed avoids fewer collisions. Although I assume that separating the mantissa from the exponent might help, but the above link does not suggest that this was the driving design decision.

H. Brandsmeier
  • 957
  • 5
  • 19
5

One reason not to just use the bit pattern is that some different bit patterns must be considered equals and thus have the same hashcode, namely

  • positive and negative zero
  • possibly denormalized numbers (I don't think this can occur with IEEE 754, but C allows other float representations).
  • possibly NANs (there are many, at least in IEEE 754. It actually requires NAN patterns to be considered unequal to themselves, which arguably means the cannot be meaningfully used in a hashtable)
Michael Borgwardt
  • 342,105
  • 78
  • 482
  • 720
  • 2
    Why must NANs be considered equal? `NAN != NAN`. – Steve Jessop Sep 13 '11 at 14:52
  • Actually, I take that back. The standard says "all evaluations of the expression `h(k)` with the same value for `k` yield the same result." not (as I expected), "all evaulations `h(k)` and `h(l)` where `k == l` yield the same result". So `==` is less relevant than "being the same value", according to the hash requirements. Whether two different NANs are "the same value" is a standards-combing exercise I can't be bothered with :-) – Steve Jessop Sep 13 '11 at 14:55
  • 4
    even though `NaN != NaN`, hash functions are to be treated as black-boxes; that is, the consumer of hash values (such as associative containers) will not make provision for non-comparability. Thus, either `hash(NaN) == hash(NaN)` has to be true, or the data type shouldn't be hashable. – rwong May 29 '13 at 17:18
4

Why are you wanting to hash floating point values? For the same reason that comparing floating point values for equality has a number of pitfalls, hashing them can have similar (negative) consequences.

However given that you really do want to do this, I suspect that the boost algorithm is complicated because when you take into account denormalized numbers different bit patterns can represent the same number (and should probably have the same hash). In IEEE 754 there are also both positive and negative 0 values that compare equal but have different bit patterns.

This probably wouldn't come up in the hashing if it wouldn't have come up otherwise in your algorithm but you still need to take care about signaling NaN values.

Additionally what would be the meaning of hashing +/- infinity and/or NaN? Specifically NaN can have many representations, should they all result in the same hash? Infinity seems to have just two representations so it seems like it would work out ok.

Mark B
  • 95,107
  • 10
  • 109
  • 188
  • 1
    Can you elaborate on why you think hashing +/- infinity and NaN is meaningless? I fail to see why this could be an issue. – Luc Touraille Sep 13 '11 at 14:34
  • 3
    At least with IEEE 754, I don't think there can be different patterns with the same value due to denormalization – Michael Borgwardt Sep 13 '11 at 14:36
  • @Michael Borgwardt Certainly +/- 0 and various values of NaN could occur in IEEE 754. I'm not sure about other values though. – Mark B Sep 13 '11 at 14:39
  • Unless the standard says otherwise, I'd expect +/- inf to be treated like any other values of the type - ideally they should have different hashes. If there are two representations of +inf in your floating point type, then they must have the same hash. – Steve Jessop Sep 13 '11 at 14:51
  • 1
    Nice answer except the pointless "Why would you want to do this?". There are use-cases. Just because you can't think of them... – Timmmm Nov 19 '12 at 15:10
  • I have a use case : I have to store a bunch of 3D vectors, but many of them are equal. If I could find a list of unique values, and then only use indices in this list, I would reduce memory consumption by ~75%. A hash table is the fastest way to group them. But I need a hash of a 3D vector then... For information, these vectors are smoothed normals in a 3D triangle mesh. – galinette Dec 04 '15 at 09:39
0

I imagine it's so that two machines with incompatible floating-point formats hash to the same value.

TonyK
  • 16,761
  • 4
  • 37
  • 72
  • And Imagine I just want to use floats in a std::unordered_set, there is no reason the hash value will be shared between machines. – galinette Dec 04 '15 at 09:45