24

I want to use a HashMap<f64, f64>, for saving the distances of a point with known x and key y to another point. f64 as value shouldn't matter here, the focus should be on key.

let mut map = HashMap<f64, f64>::new();
map.insert(0.4, f64::hypot(4.2, 50.0));
map.insert(1.8, f64::hypot(2.6, 50.0));
...
let a = map.get(&0.4).unwrap();

As f64 is neither Eq nor Hash, but only PartialEq, f64 is not sufficient as a key. I need to save the distances first, but also access the distances later by y. The type of y needs to be floating point precision, but if doesn't work with f64, I'll use an i64 with an known exponent.

I tried some hacks by using my own struct Dimension(f64) and then implementing Hash by converting the float into a String and then hashing it.

#[derive(PartialEq, Eq)]
struct DimensionKey(f64);

impl Hash for DimensionKey {
    fn hash<H: Hasher>(&self, state: &mut H) {
        format!("{}", self.0).hash(state);
    }
}

It seems very bad and both solutions, my own struct or float as integers with base and exponent seem to be pretty complicated for just a key.

Update: I can guarantee that my key never will be NaN, or an infinite value. Also, I won't calculate my keys, only iterating over them and using them. So there should no error with the known error with 0.1 + 0.2 ≠ 0.3. How to do a binary search on a Vec of floats? and this question have in common to implement total ordering and equality for a floating number, the difference lies only in the hashing or iterating.

Community
  • 1
  • 1
pixunil
  • 641
  • 1
  • 5
  • 19
  • 9
    Do you really need to fetch an object by an exact distance? Using a floating point number as a key is as much of a bad idea as testing two for equality (rounding errors do happen). – E_net4 Sep 22 '16 at 12:13
  • 1
    Duplicate of http://stackoverflow.com/q/28247990/155423 – Shepmaster Sep 22 '16 at 12:33
  • 3
    @Shepmaster: There might be the issue of `f64` not implementing `Eq` here, but I think the problem runs deeper => even if you rule out `NaN`, comparing two floats for equality is just asking for trouble. – Matthieu M. Sep 22 '16 at 14:02
  • Do you expect your keys will have any repeat values? Is it necessary that they get deduplicated by the hash map? – Veedrac Sep 22 '16 at 18:51

4 Answers4

15

Presented with no comment beyond read all the other comments and answers to understand why you probably don't want to do this:

use std::{collections::HashMap, hash};

#[derive(Debug, Copy, Clone)]
struct DontUseThisUnlessYouUnderstandTheDangers(f64);

impl DontUseThisUnlessYouUnderstandTheDangers {
    fn key(&self) -> u64 {
        self.0.to_bits()
    }
}

impl hash::Hash for DontUseThisUnlessYouUnderstandTheDangers {
    fn hash<H>(&self, state: &mut H)
    where
        H: hash::Hasher,
    {
        self.key().hash(state)
    }
}

impl PartialEq for DontUseThisUnlessYouUnderstandTheDangers {
    fn eq(&self, other: &DontUseThisUnlessYouUnderstandTheDangers) -> bool {
        self.key() == other.key()
    }
}

impl Eq for DontUseThisUnlessYouUnderstandTheDangers {}

fn main() {
    let a = DontUseThisUnlessYouUnderstandTheDangers(0.1);
    let b = DontUseThisUnlessYouUnderstandTheDangers(0.2);
    let c = DontUseThisUnlessYouUnderstandTheDangers(0.3);

    let mut map = HashMap::new();
    map.insert(a, 1);
    map.insert(b, 2);

    println!("{:?}", map.get(&a));
    println!("{:?}", map.get(&b));
    println!("{:?}", map.get(&c));
}

Basically, if you want to treat a f64 as a set of bits that have no meaning, well, we can treat them as an equivalently sized bag of bits that know how to be hashed and bitwise-compared.

Don't be surprised when one of the 16 million NaN values doesn't equal another one.

Shepmaster
  • 388,571
  • 95
  • 1,107
  • 1,366
  • 4
    Note that if the odd results with `NaN` are a problem, you can always filter them out in the constructor. – Veedrac Sep 22 '16 at 22:03
14

You could split the f64 into the integral and fractional part and store them in a struct in the following manner:

#[derive(Hash, Eq, PartialEq)]
struct Distance {
    integral: u64,
    fractional: u64
}

The rest is straightforward:

use std::collections::HashMap;

#[derive(Hash, Eq, PartialEq)]
struct Distance {
    integral: u64,
    fractional: u64
}

impl Distance {
    fn new(i: u64, f: u64) -> Distance {
        Distance {
            integral: i,
            fractional: f
        }
    }
}

fn main() {
    let mut map: HashMap<Distance, f64> = HashMap::new();

    map.insert(Distance::new(0, 4), f64::hypot(4.2, 50.0));
    map.insert(Distance::new(1, 8), f64::hypot(2.6, 50.0));

    assert_eq!(map.get(&Distance::new(0, 4)), Some(&f64::hypot(4.2, 50.0)));
}

Edit: As Veedrac said, a more general and efficient option would be to deconstruct the f64 into a mantissa-exponent-sign triplet. The function that can do this, integer_decode(), is deprecated in std, but it can be easily found in Rust GitHub.

The integer_decode() function can be defined as follows:

use std::mem;

fn integer_decode(val: f64) -> (u64, i16, i8) {
    let bits: u64 = unsafe { mem::transmute(val) };
    let sign: i8 = if bits >> 63 == 0 { 1 } else { -1 };
    let mut exponent: i16 = ((bits >> 52) & 0x7ff) as i16;
    let mantissa = if exponent == 0 {
        (bits & 0xfffffffffffff) << 1
    } else {
        (bits & 0xfffffffffffff) | 0x10000000000000
    };

    exponent -= 1023 + 52;
    (mantissa, exponent, sign)
}

The definition of Distance could then be:

#[derive(Hash, Eq, PartialEq)]
struct Distance((u64, i16, i8));

impl Distance {
    fn new(val: f64) -> Distance {
        Distance(integer_decode(val))
    }
}

This variant is also easier to use:

fn main() {
    let mut map: HashMap<Distance, f64> = HashMap::new();

    map.insert(Distance::new(0.4), f64::hypot(4.2, 50.0));
    map.insert(Distance::new(1.8), f64::hypot(2.6, 50.0));

    assert_eq!(map.get(&Distance::new(0.4)), Some(&f64::hypot(4.2, 50.0)));
}
ljedrz
  • 20,316
  • 4
  • 69
  • 97
  • This seems worse than a lossless (and more space-efficient) sign-exponent-mantissa triplet. – Veedrac Sep 22 '16 at 18:55
  • You're right, I totally forgot about that. I'll add this info. – ljedrz Sep 22 '16 at 19:07
  • 3
    Why would anyone use the "sign-exponent-mantissa" split version over simply `impl`-ing `Eq` and `Hash` for `Distance(f64)` when it suffers from the same issues that `f64` has (`0.3` != `0.1 + 0.2` whether it is in the triplet form or in `f64`)? – John Sep 22 '16 at 19:31
  • I'm not sure; if you tried that, you would get `error: no method named assert_receiver_is_total_eq found for type f64 in the current scope in this expansion of #[derive(Eq)]`. – ljedrz Sep 22 '16 at 19:40
  • I though by simple you meant a `derive`. Anyway it's a matter of preference (or profiling in case one option has better performance than the other) at this point, since the OP doesn't need any arithmetic. – ljedrz Sep 23 '16 at 06:31
  • 1
    @John MattieuM's answer involves rounding and imprecision, and involves arithmetic on every comparison. This one, in contrast, is lossless. – Veedrac Sep 23 '16 at 14:47
  • @Veedrac I don't understand what is so good about using a "lossless" alternative over simply using `f64` which is exactly identical! The whole point of _not_ using `f64` for hashing (or even simple equality check) is that `f64` is not "infinitely" precise to be useful when doing it... – John Sep 24 '16 at 10:17
  • 2
    @John The only reason an `f64` does not have a hash implementation is that `NaN` is not equal to itself, so cannot have a hash value. Using Shepmaster's solution instead of this is fine (though that one breaks the contract with `Hash`, and is harder to make safe), but I don't understand why people think rounding solves anything. Rounding without analysis of the domain just makes the problem worse. – Veedrac Sep 24 '16 at 16:52
  • @Veedrac Exactly! Shepmaster's solution with custom handling of `NaN` is easier to handle and has less overheads than what is suggested here. Blindly doing something without proper domain analysis will almost always make things worse (incl. blindly choosing not to round). – John Sep 25 '16 at 17:00
6

Unfortunately, floating types equality is hard and counter-intuitive:

fn main() {
    println!("{} {} {}", 0.1 + 0.2, 0.3, 0.1 + 0.2 == 0.3);
}

// Prints: 0.30000000000000004 0.3 false

And therefore hashing is hard too, since hashes of equal values should be equal.


If, in your case, you have a small enough range to fit your number in a i64 and you can accept the loss of precision, then a simple solution is to canonicalize first and then define equal/hash in terms of the canonical value:

use std::cmp::Eq;

#[derive(Debug)]
struct Distance(f64);

impl Distance {
    fn canonicalize(&self) -> i64 {
        (self.0 * 1024.0 * 1024.0).round() as i64
    }
}

impl PartialEq for Distance {
    fn eq(&self, other: &Distance) -> bool {
        self.canonicalize() == other.canonicalize()
    }
}

impl Eq for Distance {}

fn main() {
    let d = Distance(0.1 + 0.2);
    let e = Distance(0.3);

    println!("{:?} {:?} {:?}", d, e, d == e);
}

// Prints: Distance(0.30000000000000004) Distance(0.3) true

Hash just follows, and from then on you can use Distance as a key in the hash map:

impl Hash for Distance {
    fn hash<H>(&self, state: &mut H) where H: Hasher {
        self.canonicalize().hash(state);
    }
}

fn main() {
    let d = Distance(0.1 + 0.2);
    let e = Distance(0.3);

    let mut m = HashMap::new();
    m.insert(d, "Hello");

    println!("{:?}", m.get(&e));
}

// Prints: Some("Hello")

Warning: To reiterate, this strategy only works if (a) the dynamic range of values is small enough to be captured in a i64 (19 digits) and if (b) the dynamic range is known in advance as the factor is static. Fortunately, this holds for many common problems, but it is something to document and test...

Matthieu M.
  • 287,565
  • 48
  • 449
  • 722
  • 1
    It would be better to `canonicalize` to `f32` instead of multiplying by a constant and casting to an integer, since `1e-12` and `2e-15` will both map to `0` in the current scheme, but different values in `f32`. It takes care of the precision problem too since it the cast is only made during comparisons. – John Sep 22 '16 at 16:56
  • @John: maybe, maybe not. It all depends on what you want to consider equal. For a distance measure in meters, `1e-12` is 1 pico-meter: I would be very surprised if a difference of 1 pico-meter was relevant for any kind of geo-tracking (for example). It really is a domain modelling decision. If you wish to preserve more precision, then a hash-map look-up is flawed and you'll need something like bounding volumes, KD-Trees, etc... – Matthieu M. Sep 22 '16 at 18:51
  • 1
    I don't like this solution; it adds unnecessary imprecision and doesn't seem to map to any domain well. If rounding like this was sufficient, you should never have been using floats in the first place. – Veedrac Sep 22 '16 at 18:52
  • @Veedrac Well this is done only for the lookup. So I guess it depends on what exactly you want, as MatthieuM. said... (Taking `1024*1024` as just a placeholder indicating the range of interest.) – John Sep 22 '16 at 19:42
  • 3
    I do not see the relevance of the reasoning that if ``f32`` additions do not yield exact, expected results, hashing should be affected in any way. After all, there is no "law", which entails ``x1 + x2 == x3`` -> ``x1.hash() + x2.hash() == x3.hash()``. If rounding errors are a problem for the application, then don't use floating points, but do not state, that float values cannot be hashed because of rounding. Use bignums or whatever rust has to offer in that department. The only acceptable reasoning here is the ``NaN`` issue. – BitTickler May 10 '19 at 06:09
  • 3
    @BitTickler: Hashing floats is easy, you can just reinterpret them as integers and hash the integer. It satisfies the requirement that should two floats be equal their hash is equal (since `NaN` is not equal to anything). The point made here, though, is that this scheme relies on exact equality (bitwise equality) whereas by nature floating points have rounding errors and are thus equal *within a tolerance threshold*. And that's where hashing fails: it cannot handle this *tolerance threshold*. – Matthieu M. May 10 '19 at 07:25
6

You can use the ordered_float crate which does this for you.

optevo
  • 2,016
  • 20
  • 17