Compare a 32 bit float and a 32 bit integer without casting to double, when either value could be too large to fit the other type exactly

Question

I have a 32 bit floating point f number (known to be positive) that I need to convert to 32 bit unsigned integer. It's magnitude might be too large to fit. Furthermore, there is downstream computation that requires some headroom. I can compute the maximum acceptable value m as a 32 bit integer. How do I efficiently determine in C++11 on a constrained 32 bit machine (ARM M4F) if f <= m mathematically. Note that the types of the two values don't match. The following three approaches each have their issues:

static_cast<uint32_t>(f) <= m: I think this triggers undefined behaviour if f doesn't fit the 32 bit integer
f <= static_cast<float>(m): if m is too large to be converted exactly, the converted value could be larger than m such that the subsequent comparison will produce the wrong result in certain edge cases
static_cast<double>(f) <= static_cast<double>(m): is mathematically correct, but requires casting to, and working with double, which I'd like to avoid for efficiency reasons

Surely there must be a way to convert an integer to a float directly with specified rounding direction, i.e. guaranteeing the result not to exceed the input in magnitude. I'd prefer a C++11 standard solution, but in the worst case platform intrinsics could qualify as well.

you should only need to look at the exponent to see if the mantissa would slide off the top of a 32 bit integer. so shift, mask and do a greater than or less than compare. — old_timer, May 09 '17 at 12:59
@old_timer: In fact, unsigned integer comparisons provide the correct ordering when applied to the binary representation of positive IEEE 754 floats, even when taking into account the mantissa. So neither the shift nor the mask operation would be necessary. The tricky part is the find the correct exponent that corresponds to a runtime unsigned integer value `m`, which is the gist of int-to-float conversion. — burnpanck, May 10 '17 at 14:09
if you get a clean binary conversion from float to int without it doing a numerical conversion, sure you could do two or a few comparisons to isolate the exponent — old_timer, May 10 '17 at 17:13
possible duplicate of [Comparing uint64_t and float for numeric equivalence](https://stackoverflow.com/q/32810583/995714), [How to properly compare an integer and a floating-point value?](https://stackoverflow.com/q/58734034/995714) — phuclv, Nov 19 '19 at 13:07
Does this answer your question? [How to properly compare an integer and a floating-point value?](https://stackoverflow.com/questions/58734034/how-to-properly-compare-an-integer-and-a-floating-point-value) — phuclv, Nov 19 '19 at 13:07

Martin Bonner supports Monica · Accepted Answer · 2017-05-15T08:33:41.350

4

I think your best bet is to be a bit platform specific. 2³² can be represented precisely in floating point. Check if f is too large to fit at all, and then convert to unsigned and check against m.

const float unsigned_limit = 4294967296.0f;
bool ok = false;
if (f < unsigned_limit)
{
    const auto uf = static_cast<unsigned int>(f);
    if (uf <= m)
    {
        ok = true;
    }
}

Not fond of the double comparison, but it's clear.

If f is usually significantly less than m (or usually significantly greater), one can test against float(m)*0.99f (respectively float(m)*1.01f), and then do the exact comparison in the unusual case. That is probably only worth doing if profiling shows that the performance gain is worth the extra complexity.

edited May 15 '17 at 08:33

answered May 09 '17 at 06:48

Martin Bonner supports Monica

28,528
3
51
88

Indeed, the double comparison is a bit disappointing, but probably as efficient as possible. I'll wait with accepting a bit more, in case there's another ingenious solution hiding somewhere... – burnpanck May 09 '17 at 07:28
A 32-bit floatinig point usually has a precision of 24 bits only. 2³² can be represented precisely, but 2³²-1 not. – CAF May 09 '17 at 09:53

Compare a 32 bit float and a 32 bit integer without casting to double, when either value could be too large to fit the other type exactly

1 Answers1

Linked