How to determine the rounding direction when casting a double to a float?

Question

I'm looking for an algorithm to determine the rounding direction when casting an arbitrary 64-bit double to a 32-bit float. My specific use-case for this check is to cast a 64-bit double to 32-bit float with rounding toward infinity.

At first, I applied the following criteria to check for bits in the truncated portion of the mantissa. If the truncated portion of the mantissa is nonzero, then the cast must have rounded down!

const F64_MASK: u64 = (1 << 29) - 1;
fn is_rounded_up_when_casted(v: f64) -> bool {
    v.to_bits() & F64_MASK > 0
}

This criteria doesn't, however, identify all cases-- the last three bits of the exponent are also truncated. I tried modifying the mask to check these exponent bits as well:

const F64_MASK: u64 = (1u64 << 55) - (1 << 52) + (1 << 29) - 1;

Unfortunately this check doesn't work. For example the number 1.401298464324817e−45 has an exponent in which the three truncated bits are 010 and yet is still exactly represented in float/f32.

EDIT: I don't think it can be said that a nonzero mantissa means positive rounding. I think I need a different approach. I think the exponent just increases the range of numbers, so that can be handled with some separate checks. The rounding direction may just be a function of the leading bit of the truncated portion of the mantissa?

This looks like something that you might find useful: https://stackoverflow.com/a/42541830/6304086 — Hadus, Jul 08 '21 at 00:48
Could you explain how `v as f32 as f64 == v` is not sufficient? You say "the cast to a float rounds", but surely that's exactly what casting back and comparing handles? — Kevin Reid, Jul 08 '21 at 02:40
@KevinReid: I re-read my question more closely. I made a mistake when isolating the question for SO. I'm sorry. My motivation is to cast with rounding towards infinity. My initial approach was `if v as f32 as f64 == v { v as f32 } else {f32::from_bits((v as f32).to_bits() + 1)}`. For a value like `1.9999999999999998f64`, the bit-increment-branch is taken. In the context of my original question, the check works. In the intended context, the check is a "false-positive" in that it chooses the bit-increment-path. "How to tell if a double was rounded up while casting to float?" is more accurate. — Shoeboxam, Jul 08 '21 at 03:53
Can't you just call [`fesetround`](https://linux.die.net/man/3/fesetround) to set the rounding mode you want? (or `fegetround` to get the current mode and act accordingly) — Jmb, Jul 08 '21 at 07:06
@Jmb: At least from rust, fesetround is pretty unreliable. LLVM has an open issue here where adjusting the rounding mode can cause potentially UB: https://bugs.llvm.org//show_bug.cgi?id=8100. — Shoeboxam, Jul 08 '21 at 17:23
LLVM 13 will have the [`llvm.flt.rounds` and `llvm.set.rounding` intrinsics](https://llvm.org/docs/LangRef.html#floating-point-environment-manipulation-intrinsics). I doubt that this functionality will be exposed in safe Rust though. — HHK, Jul 09 '21 at 18:57

Aplet123 · Answer 1 · 2021-07-08T12:30:31.783

The edge case that you found has to do with the fact that subnormal f32 values can actually represent exponents less than their typical minimum. I've written a function that I believe covers all the edge cases:

const F64_MANTISSA_SIZE: u64 = 52;
const F64_MANTISSA_MASK: u64 = (1 << F64_MANTISSA_SIZE) - 1;
const F64_EXPONENT_SIZE: u64 = 64 - F64_MANTISSA_SIZE - 1;
const F64_EXPONENT_MASK: u64 = (1 << F64_EXPONENT_SIZE) - 1; // shift away the mantissa first
const F32_MANTISSA_SIZE: u64 = 23;
const F64_TRUNCATED_MANTISSA_SIZE: u64 = F64_MANTISSA_SIZE - F32_MANTISSA_SIZE;
const F64_TRUNCATED_MANTISSA_MASK: u64 = (1 << F64_TRUNCATED_MANTISSA_SIZE) - 1;

fn is_exactly_representable_as_f32(v: f64) -> bool {
    let bits = v.to_bits();
    let mantissa = bits & F64_MANTISSA_MASK;
    let exponent = (bits >> F64_MANTISSA_SIZE) & F64_EXPONENT_MASK;
    let _sign = bits >> (F64_MANTISSA_SIZE + F64_EXPONENT_SIZE) != 0;
    if exponent == 0 {
        // if mantissa == 0, the float is 0 or -0, which is representable
        // if mantissa != 0, it's a subnormal, which is never representable
        return mantissa == 0;
    }
    if exponent == F64_EXPONENT_MASK {
        // either infinity or nan, all of which are representable
        return true;
    }
    // remember to subtract the bias
    let computed_exponent = exponent as i64 - 1023;
    // -126 and 127 are the min and max value for a standard f32 exponent
    if (-126..=127).contains(&computed_exponent) {
        // at this point, it's only exactly representable if the truncated mantissa is all zero
        return mantissa & F64_TRUNCATED_MANTISSA_MASK == 0;
    }
    // exponents less than 2**(-126) may be representable by f32 subnormals
    if computed_exponent < -126 {
        // this is the number of leading zeroes that need to be in the f32 mantissa
        let diff = -127 - computed_exponent;
        // this is the number of bits in the mantissa that must be preserved (essentially mantissa with trailing zeroes trimmed off)
        let mantissa_bits = F64_MANTISSA_SIZE - (mantissa.trailing_zeros() as u64).min(F64_MANTISSA_SIZE) + 1;
        // the leading zeroes + essential mantissa bits must be able to fit in the smaller mantissa size
        return diff as u64 + mantissa_bits <= F32_MANTISSA_SIZE;
    }
    // the exponent is >127 so f32s can't go that high
    return false;
}

(it seems my browser still had the pre-edit version cached so my answer is only applicable to the original version of the question, I'll see if I can figure something out for the new version) — Aplet123, Jul 08 '21 at 12:18
I updated my partial answer with a simpler approach. I think I just needed a fresh brain. Thanks for this answer, I learned a lot from it. — Shoeboxam, Jul 10 '21 at 20:26

Shoeboxam · Accepted Answer · 2021-07-10T20:32:08.317

No need for mangling bits:

#[derive(PartialEq, std::fmt::Debug)]
enum Direction { Equal, Up, Down }
fn get_rounding_direction(v: f64) -> Direction {
    match v.partial_cmp(&(v as f32 as f64)) {
        Some(Ordering::Greater) => Direction::Down,
        Some(Ordering::Less) => Direction::Up,
        _ => Direction::Equal
    }
}

And some tests to check correctness.

#[cfg(test)]
#[test]
fn test_get_rounding_direction() {
    // check that the f64 one step below 2 casts to exactly 2
    assert_eq!(get_rounding_direction(1.9999999999999998), Direction::Up);

    // check edge cases
    assert_eq!(get_rounding_direction(f64::NAN), Direction::Equal);
    assert_eq!(get_rounding_direction(f64::NEG_INFINITY), Direction::Equal);
    assert_eq!(get_rounding_direction(f64::MIN), Direction::Down);
    assert_eq!(get_rounding_direction(-f64::MIN_POSITIVE), Direction::Up);
    assert_eq!(get_rounding_direction(-0.), Direction::Equal);
    assert_eq!(get_rounding_direction(0.), Direction::Equal);
    assert_eq!(get_rounding_direction(f64::MIN_POSITIVE), Direction::Down);
    assert_eq!(get_rounding_direction(f64::MAX), Direction::Up);
    assert_eq!(get_rounding_direction(f64::INFINITY), Direction::Equal);

    // for all other f32
    for u32_bits in 1..f32::INFINITY.to_bits() - 1 {
        let f64_value = f32::from_bits(u32_bits) as f64;
        let u64_bits = f64_value.to_bits();

        if u32_bits % 100_000_000 == 0 {
            println!("checkpoint every 600 million tests: {}", f64_value);
        }
        // check that the f64 equivalent to the current f32 casts to a value that is equivalent
        assert_eq!(get_rounding_direction(f64_value), Direction::Equal, "at {}, {}", u32_bits, f64_value);
        // check that the f64 one step below the f64 equivalent to the current f32 casts to a value that is one step greater
        assert_eq!(get_rounding_direction(f64::from_bits(u64_bits - 1)), Direction::Up, "at {}, {}", u32_bits, f64_value);
        // check that the f64 one step above the f64 equivalent to the current f32 casts to a value that is one step less
        assert_eq!(get_rounding_direction(f64::from_bits(u64_bits + 1)), Direction::Down, "at {}, {}", u32_bits, f64_value);

        // same checks for negative numbers
        let u64_bits = (-f64_value).to_bits();
        assert_eq!(get_rounding_direction(f64_value), Direction::Equal, "at {}, {}", u32_bits, f64_value);
        assert_eq!(get_rounding_direction(f64::from_bits(u64_bits - 1)), Direction::Down, "at {}, {}", u32_bits, f64_value);
        assert_eq!(get_rounding_direction(f64::from_bits(u64_bits + 1)), Direction::Up, "at {}, {}", u32_bits, f64_value);
    }
}

To specifically cast with rounding towards infinity:

fn cast_toward_inf(vf64: f64) -> f32 {
    let vf32 = vf64 as f32;
    if vf64 > vf32 as f64 { f32::from_bits(vf32.to_bits() + 1) } else { vf32 }
}

It's possible to determine the rounding primarily from the 28th bit (first bit of the truncated portion of the mantissa), but handling edge cases introduces significant complexity.

How to determine the rounding direction when casting a double to a float?

2 Answers2