25

I'm currently implementing a hash table in C++ and I'm trying to make a hash function for floats...

I was going to treat floats as integers by padding the decimal numbers, but then I realized that I would probably reach the overflow with big numbers...

Is there a good way to hash floats?

You don't have to give me the function directly, but I'd like to see/understand different concepts...

Notes:

  1. I don't need it to be really fast, just evenly distributed if possible.

  2. I've read that floats should not be hashed because of the speed of computation, can someone confirm/explain this and give me other reasons why floats should not be hashed? I don't really understand why (besides the speed)

Bill the Lizard
  • 398,270
  • 210
  • 566
  • 880
Pacane
  • 20,273
  • 18
  • 60
  • 97

7 Answers7

18

It depends on the application but most of time floats should not be hashed because hashing is used for fast lookup for exact matches and most floats are the result of calculations that produce a float which is only an approximation to the correct answer. The usually way to check for floating equality is to check if it is within some delta (in absolute value) of the correct answer. This type of check does not lend itself to hashed lookup tables.

EDIT:

Normally, because of rounding errors and inherent limitations of floating point arithmetic, if you expect that floating point numbers a and b should be equal to each other because the math says so, you need to pick some relatively small delta > 0, and then you declare a and b to be equal if abs(a-b) < delta, where abs is the absolute value function. For more detail, see this article.

Here is a small example that demonstrates the problem:

float x = 1.0f;
x = x / 41;
x = x * 41;
if (x != 1.0f)
{
    std::cout << "ooops...\n";
}

Depending on your platform, compiler and optimization levels, this may print ooops... to your screen, meaning that the mathematical equation x / y * y = x does not necessarily hold on your computer.

There are cases where floating point arithmetic produces exact results, e.g. reasonably sized integers and rationals with power-of-2 denominators.

President James K. Polk
  • 40,516
  • 21
  • 95
  • 125
  • Could you please explain a little more? "The usually way to check for floating equality is to check if it is within some delta (in absolute value) of the correct answer." – Pacane Nov 21 '10 at 17:30
  • +1 -- The answer is not to do it in the first place. Don't use floats as keys in maps or hash-tables; you'll run into problems sooner or later. – Leo Davidson Nov 21 '10 at 18:11
  • 2
    @Leo Davidson I know I'll run in troubles, the goal of this exercise is to find when exactly ;-) – Pacane Nov 21 '10 at 20:10
  • 23
    Downvote because it doesn't answer the question. I am here because I need non-exact hashes. Advice on dangers is all nice and well, but answering the question is better. – xcut Nov 09 '12 at 16:00
  • 5
    It is often a disservice to the questioner to simply answer questions as asked. I speak from experience on this forum. Someone asking to hash floats is probably pursuing a wrong course, especially considering the question as asked. If you want to ask a question about fuzzy lookups, about equivalence classes of floating point numbers, and so on that is a different question. – President James K. Polk Nov 09 '12 at 21:46
  • 4
    Just because the asked question is usually the wrong approach doesn't mean it *always* is. Sometimes programmers need to do things not because it is a good idea, but nonetheless, there are external factors compelling it (e.g. a stubborn boss, shortsighted requirements, etc.). Or there might even be unusual circumstances where it does make sense. Answering the question as-is would be a disservice, but part of the purpose of Stack Overflow is to provide a knowledge base for future Google searchers. The actual question shouldn't be entirely ignored. – Dominick Pastore Apr 07 '20 at 15:02
  • 2
    @PresidentJamesK.Polk Just because you’re unaware of valid use cases doesn’t mean there’re no such use cases. For instance, STL files https://en.wikipedia.org/wiki/STL_(file_format) contain a soup of triangles, yet modern 3D GPUs are performing best on indexed meshes: vertices shared across triangles save gigaflops of computations in vertex shaders. – Soonts Jun 03 '22 at 14:48
  • 1
    @Soonts: What is the evidence that the OP's case was one of those use cases? Where do I claim that there are *no* such use cases? – President James K. Polk Jun 03 '22 at 16:23
12

If your hash function did the following you'd get some degree of fuzziness on the hash lookup

unsigned int Hash( float f )
{
    unsigned int ui;
    memcpy( &ui, &f, sizeof( float ) );
    return ui & 0xfffff000;
}

This way you'll mask off the 12 least significant bits allowing for a degree of uncertainty ... It really depends on yout application however.

Goz
  • 61,365
  • 24
  • 124
  • 204
  • 2
    No, `0xfffff000` masks off 3 nibbles, which is 12 bits. Probably a little too much. If you want to mask off 3 bits, use `0xfffffff8`. – fredoverflow Nov 21 '10 at 14:06
  • 1
    @FredOverflow: No .. you are right .. I didn't mean 3 ... mind failure there. changed – Goz Nov 21 '10 at 14:10
  • @Goz: this depends on the internal representation of `float` on the target machine though, since you assume here than the mantissa is located in the least significant bits, and is stored in little-endian fashion. Though the idea of fuzziness is definitely the way to go. – Matthieu M. Nov 21 '10 at 17:22
  • 1
    You're still going to end up with pairs of numbers with a very small relative difference that end up in different bins. – Ben Voigt Nov 21 '10 at 18:01
  • @Ben: Correct, but that only happens in 1 of 4096 cases (if "very small" means direct neighbors in the set of float values). Plus, this solution has the beautiful property of equality being transitive, that is, `Hash(x) == Hash(y)` and `Hash(y) == Hash(z)` implies `Hash(x) == Hash(z)`, which does *not* hold if you use the good old delta trick of comparing floats directly. I really like this solution! – fredoverflow Nov 21 '10 at 18:09
  • @Matthieu: You are, of course, quite right. However for it to become a problem a big endian float would have to get stored as a little endian (or vice-versa). Otherwise the mask is just in the correct endian format too and everything is good. Seeing as I haven't yet come across a platform that has different endian rules for float and integers I'm going to say your point is worth bearing in mind but largerly irrelevant. – Goz Nov 21 '10 at 18:41
  • 3
    @Ben: Of course you will. If you are bucketing, which a hash algorithm necessarily does, you will always have this issue. imagine buckets at every 0.1 that go to 0.05 either side. that means that 1.4999999 goes in 1 bucket and 1.5 goes in another. You just have to live with that or ditch any form of bucketing ... – Goz Nov 21 '10 at 18:44
10

You can use the std hash, it's not bad:

 std::size_t myHash = std::cout << std::hash<float>{}(myFloat);
O_Z
  • 1,515
  • 9
  • 11
6
unsigned hash(float x)
{
    union
    {
        float f;
        unsigned u;
    };
    f = x;
    return u;
}

Technically undefined behavior, but most compilers support this. Alternative solution:

unsigned hash(float x)
{
    return (unsigned&)x;
}

Both solutions depend on the endianness of your machine, so for example on x86 and SPARC, they will produce different results. If that doesn't bother you, just use one of these solutions.

fredoverflow
  • 256,549
  • 94
  • 388
  • 662
  • 2
    Aren't there some standard functions that can be used to grab the mantissa and exponent? I'm not a float kind of guy, or much of C++ guy either, so I was just wondering ... – President James K. Polk Nov 21 '10 at 13:54
  • @GregS: Not as far as I know. Why would you want to grab the mantissa and exponent, anyway? A float is 32 bit, why not simply interpret that as an unsigned? As long as you avoid NaNs, you *should* be fine... – fredoverflow Nov 21 '10 at 13:56
  • 2
    @FredOverflow: I was just guessing that grabbing the mantissa and exponent separately would produce less machine- and compiler-dependent results. I would still depend on the sizes of the mantissa and the exponent which might turn out to be just as compiler and machine dependent. – President James K. Polk Nov 21 '10 at 14:04
  • @FredOverflow making a use of an uninitialized variable wouldn't return 2 different results for the same value passed in parameter? And also, what does the second way do exactly..? – Pacane Nov 21 '10 at 17:33
  • 1
    @Pacane: Are you referring to the variable `u`? The union hack is based upon the assumption that `f` and `u` share the same memory, hence `u` is "initialized" by writing to `f`. Yes, this is highly implementation-specific, but it usually works. The second way is a reference-cast. It introduces another way to access the object `x` (of type float) as if it were an object of type unsigned. – fredoverflow Nov 21 '10 at 17:39
  • @FredOverflow About the reference-cast, is there some lost data in the process, or it just pads the number until it's unsigned? – Pacane Nov 21 '10 at 17:49
  • 2
    @Pacane: What do you mean by "pad"? There is no value conversion going on whatsoever, if that's what you think. For example, `hash(3.14f)` does not yield 3, but 1078523331, because both values are represented by the machine word `0x4048f5c3`. Of course this assumes that int and float both are 32 bit types, which is highly implementation-specific etc. (You can think of the reference cast as basically shorthand for `*(unsigned*)&x`.) – fredoverflow Nov 21 '10 at 17:54
5

You can of course represent a float as an int type of the same size to hash it, however this naive approach has some pitfalls you need to be careful of...

Simply converting to a binary representation is error prone since values which are equal wont necessarily have the same binary representation.

An obvious case: -0.0 wont match 0.0 for example. *

Further, simply converting to an int of the same size wont give very even distribution, which is often important (implementing a hash/set that uses buckets for example).

Suggested steps for implementation:

  • filter out non-finite cases (nan, inf) and (0.0, -0.0 whether you need to do this explicitly or not depends on the method used).
  • convert to an int of the same size
    (that is - use a union for example to represent the float as an int, not simply cast to an int).
  • re-distribute the bits, (intentionally vague here!), this is basically a speed vs quality tradeoff. But if you have many values in a small range you probably don't want them to in a similar range too.

*: You may wan't to check for (nan and -nan) too. How to handle those exactly depends on your use case (you may want to ignore sign for all nan's as CPython does).

Python's _Py_HashDouble is a good reference for how you might hash a float, in production code (ignore the -1 check at the end, since that's a special value for Python).

ideasman42
  • 42,413
  • 44
  • 197
  • 320
  • The obvious case of “-0.0 wont match 0.0 for example” is the **only** example of a pair of floating-point values that are equal for `==` and have differing representations, so I am not sure why you make a generality out of it. Infinities certainly do not need to be filtered out. Some have (seriously) recommended to return a random integer for `hash(NaN)`, but it seems more sound to simply treat the use of `NaN` as key in a hashtable as an error: http://research.swtch.com/randhash – Pascal Cuoq Feb 16 '15 at 23:46
  • PS: the blog post I linked to was posted on April 1st. I didn't realize this because I read it from the archives. It may not be serious, but at the same time, a random result for hash(NaN) means the binding(s) with NaN as key are present in the hashtable and can be iterated on, so it is actually a good solution for some usecases. – Pascal Cuoq Feb 16 '15 at 23:49
  • @Pascal Cuoq - Exactly how you deal with `!finite` values is up to your own implementation, I'm simply stating you should be aware of them when hashing floats, and simply converting a float to an int as is suggested in other answers is overlooking rather a lot. re: `-0 vs 0` - there is `-nan` / `nan` but how to class these may depend on your own preference (you may want to ignore the sign of a `nan` as Python does). Updated the answer. – ideasman42 Feb 17 '15 at 01:41
  • Note, would **strongly** suggest NOT to return a random integer from `nan`. that was suggested as a joke for a reason. keep the hash function deterministic, or use some kind of error is more a implementation detail. – ideasman42 Feb 17 '15 at 05:26
  • It does not seem like a joke to me. “Deterministic” means that `x == y` implies `hash(x) == hash(y)`, which remains true for `NaN` and `NaN` even if `hash(NaN)` is defined as random. The reasons why it is desirable **not** to return the same value very time are much stronger than some misconception about “deterministic” and are explained in the blog post. – Pascal Cuoq Feb 17 '15 at 06:42
  • You could draw a distinction between *"numeric equality"* and *"value identity"* here, If you impliment a `set` which makes use of a hash, having `nan` show up multiple times isn't useful. Of course you could make some argument for it, but even then its possible `nan` will randomly give the same `hash` multiple times.... **dont do it**. At that point its a discussion between you and whoever uses the data structure. But as I said before, there is a reason this isnt common practice. – ideasman42 Feb 17 '15 at 19:33
1

If you're interested, I just made a hash function that uses floating point and can hash floats. It also passes SMHasher ( which is the main bias-test for non-crypto hash functions ). It's a lot slower than normal non-cryptographic hash functions due to the float calculations.

I'm not sure if tifuhash will become useful for all applications, but it's interesting to see a simple floating point function pass both PractRand and SMHasher.

The main state update function is very simple, and looks like:

function q( state, val, numerator, denominator ) {
  // Continued Fraction mixed with Egyptian fraction "Continued Egyptian Fraction"
  // with denominator = val + pos / state[1]
  state[0] += numerator / denominator;
  state[0] = 1.0 / state[0];

  // Standard Continued Fraction with a_i = val, b_i = (a_i-1) + i + 1
  state[1] += val;
  state[1] = numerator / state[1];
}

Anyway, you can get it on npm Or you can check out the github

Using is simple:

const tifu = require('tifuhash');

const message = 'The medium is the message.';
const number = 333333333;
const float = Math.PI;

console.log( tifu.hash( message ), 
  tifu.hash( number ),
  tifu.hash( float ),
tifu.hash( ) );

There's a demo of some hashes on runkit here https://runkit.com/593a239c56ebfd0012d15fc9/593e4d7014d66100120ecdb9

Side note: I think that in future using floating point,possibly big arrays of floating point calculations, could be a useful way to make more computationally-demanding hash functions in future. A weird side effect I discovered of using floating point is that the hashes are target dependent, and I surmise maybe they could be use to fingerprint the platforms they were calculated on.

authorized
  • 51
  • 1
  • 5
-1

Because of the IEEE byte ordering the Java Float.hashCode() and Double.hashCode() do not give good results. This problem is wellknown and can be adressed by this scrambler:

class HashScrambler {

    /**
     * https://sites.google.com/site/murmurhash/
     */
    static int murmur(int x) {
        x ^= x >> 13;
        x *= 0x5bd1e995;
        return x ^ (x >> 15);
    }

}

You then get a good hash function, which also allows you to use Float and Double in hash tables. But you need to write your own hash table that allows a custom hash function.

Since in a hash table you need also test for equality, you need an exact equality to make it work. Maybe the later is what President James K. Polk intends to adress?