How to check if float can be exactly represented as an integer

Question

I'm looking to for a reasonably efficient way of determining if a floating point value (double) can be exactly represented by an integer data type (long, 64 bit).

My initial thought was to check the exponent to see if it was 0 (or more precisely 127). But that won't work because 2.0 would be e=1 m=1...

So basically, I am stuck. I have a feeling that I can do this with bit masks, but I'm just not getting my head around how to do that at this point.

So how can I check to see if a double is exactly representable as a long?

Thanks

You can extract the mantissa and the exponent parts, and see if (after removing _exponent + 1_ number of digits from the left of the mantissa, all the other bits are 0 (which means there is no decimal part). You'll have to handle negative exponents differently (any non-zero `double` with a negative exponent would be a fraction) — Seth Carnegie, Jan 18 '12 at 04:41
@dasblinkenlight no need, I think Mysticial's answer is better — Seth Carnegie, Jan 18 '12 at 04:47
The C/C++ standards specify a preprocessor macro / known value for this, `DBL_MANT_DIG` as the number of digits in the mantissa of a `double`. The unit of a "digit" as per the standard is `FLT_RADIX`; for "ordinary" IEEE754 `double` the radix is 2, and the mantissa has 53 such "digits" (aka ... bits). The maximally representative integer at full precision should therefore be `(FLT_RADIX << (DBL_MANT_DIG-1)) - 1`. — FrankH., Jul 30 '13 at 23:05
possible duplicate of [biggest integer that can be stored in a double](http://stackoverflow.com/questions/1848700/biggest-integer-that-can-be-stored-in-a-double) — FrankH., Jul 30 '13 at 23:12

Ambroz Bizjak · Answer 1 · 2013-07-23T22:51:42.047

I think I have found a way to clamp a double into an integer in a standard-conforming fashion (this is not really what the question is about, but it helps a lot). First, we need to see why the obvious code is not correct.

// INCORRECT CODE
uint64_t double_to_uint64 (double x)
{
    if (x < 0.0) {
        return 0;
    }
    if (x > UINT64_MAX) {
        return UINT64_MAX;
    }
    return x;
}

The problem here is that in the second comparison, UINT64_MAX is being implicitly converted to double. The C standard does not specify exactly how this conversion works, only that it is to be rounded up or down to a representable value. This means that the second comparison may be false, even if should mathematically be true (which can happen when UINT64_MAX is rounded up, and 'x' is mathematically between UINT64_MAX and (double)UINT64_MAX). As such, the conversion of double to uint64_t can result in undefined behavior in that edge case.

Surprisingly, the solution is very simple. Consider that while UINT64_MAX may not be exactly representable in a double, UINT64_MAX+1, being a power of two (and not too large), certainly is. So, if we first round the input to an integer, the comparison x > UINT64_MAX is equivalent to x >= UINT64_MAX+1, except for possible overflow in the addition. We can fix the overflow by using ldexp instead of adding one to UINT64_MAX. That being said, the following code should be correct.

/* Input: a double 'x', which must not be NaN.
 * Output: If 'x' is lesser than zero, then zero;
 *         otherwise, if 'x' is greater than UINT64_MAX, then UINT64_MAX;
 *         otherwise, 'x', rounded down to an integer.
 */
uint64_t double_to_uint64 (double x)
{
    assert(!isnan(x));
    double y = floor(x);
    if (y < 0.0) {
        return 0;
    }
    if (y >= ldexp(1.0, 64)) {
        return UINT64_MAX;
    }
    return y;
}

Now, to back to your question: is x is exactly representable in an uint64_t? Only if it was neither rounded nor clamped.

/* Input: a double 'x', which must not be NaN.
 * Output: If 'x' is exactly representable in an uint64_t,
 *         then 1, otherwise 0.
 */
int double_representable_in_uint64 (double x)
{
    assert(!isnan(x));
    return (floor(x) == x && x >= 0.0 && x < ldexp(1.0, 64));
}

The same algorithm can be used for integers of different size, and also for signed integers with a minor modification. The code that follows does some very basic tests of the uint32_t and uint64_t versions (only false positives can possibly be caught), but is also suitable for manual examination of the edge cases.

#include <inttypes.h>
#include <math.h>
#include <limits.h>
#include <assert.h>
#include <stdio.h>

uint32_t double_to_uint32 (double x)
{
    assert(!isnan(x));
    double y = floor(x);
    if (y < 0.0) {
        return 0;
    }
    if (y >= ldexp(1.0, 32)) {
        return UINT32_MAX;
    }
    return y;
}

uint64_t double_to_uint64 (double x)
{
    assert(!isnan(x));
    double y = floor(x);
    if (y < 0.0) {
        return 0;
    }
    if (y >= ldexp(1.0, 64)) {
        return UINT64_MAX;
    }
    return y;
}

int double_representable_in_uint32 (double x)
{
    assert(!isnan(x));
    return (floor(x) == x && x >= 0.0 && x < ldexp(1.0, 32));
}

int double_representable_in_uint64 (double x)
{
    assert(!isnan(x));
    return (floor(x) == x && x >= 0.0 && x < ldexp(1.0, 64));
}

int main ()
{
    {
        printf("Testing 32-bit\n");
        for (double x = 4294967295.999990; x < 4294967296.000017; x = nextafter(x, INFINITY)) {
            uint32_t y = double_to_uint32(x);
            int representable = double_representable_in_uint32(x);
            printf("%f -> %" PRIu32 " representable=%d\n", x, y, representable);
            assert(!representable || (double)(uint32_t)x == x);
        }
    }
    {
        printf("Testing 64-bit\n");
        double x = ldexp(1.0, 64) - 40000.0;
        for (double x = 18446744073709510656.0; x < 18446744073709629440.0; x = nextafter(x, INFINITY)) {
            uint64_t y = double_to_uint64(x);
            int representable = double_representable_in_uint64(x);
            printf("%f -> %" PRIu64 " representable=%d\n", x, y, representable);
            assert(!representable || (double)(uint64_t)x == x);
        }
    }
}

score 10 · Accepted Answer · edited May 23 '17 at 12:15

10

Here's one method that could work in most cases. I'm not sure if/how it will break if you give it NaN, INF, very large (overflow) numbers...
(Though I think they will all return false - not exactly representable.)

You could:

Cast it to an integer.
Cast it back to a floating-point.
Compare with original value.

Something like this:

double val = ... ;  //  Value

if ((double)(long long)val == val){
    //  Exactly representable
}

floor() and ceil() are also fair game (though they may fail if the value overflows an integer):

floor(val) == val
ceil(val) == val

And here's a messy bit-mask solution:
This uses union type-punning and assumes IEEE double-precision. Union type-punning is only valid in C99 TR2 and later.

int representable(double x){
    //  Handle corner cases:
    if (x == 0)
      return 1;

    //  -2^63 is representable as a signed 64-bit integer, but +2^63 is not.
    if (x == -9223372036854775808.)
      return 1;

    //  Warning: Union type-punning is only valid in C99 TR2 or later.
    union{
        double f;
        uint64_t i;
    } val;

    val.f = x;

    uint64_t exp = val.i & 0x7ff0000000000000ull;
    uint64_t man = val.i & 0x000fffffffffffffull;
    man |= 0x0010000000000000ull;  //  Implicit leading 1-bit.

    int shift = (exp >> 52) - 1075;
    //  Out of range
    if (shift < -52 || shift > 10)
        return 0;

    //  Test mantissa
    if (shift < 0){
        shift = -shift;
        return ((man >> shift) << shift) == man;
    }else{
        return ((man << shift) >> shift) == man;
    }
}

edited May 23 '17 at 12:15

Community

1
1

answered Jan 18 '12 at 04:41

Mysticial

464,885
45
335
332

1

Is there any chance the casts would be optimized out? Or can that not happen? – Dan Fego Jan 18 '12 at 04:47
@DanFego That would change the program. So no the compiler isn't allowed to optimize them out. – Mysticial Jan 18 '12 at 04:49
Perfect. I figured I was missing something completely. Thanks! – ircmaxell Jan 18 '12 at 04:59
1

With respect to the edit: why not use the std `ieee754_float` (that's included in `ieee754.h`) union to get the mantissa and exponents instead of the non-standard union cast that you did? – ircmaxell Jan 18 '12 at 05:23
1

@ircmaxell Because I didn't know that existed. :) Amazing how you learn things even when answering questions... EDIT: MSVC doesn't have that header... – Mysticial Jan 18 '12 at 05:24
1

What standard specifies `ieee754.h`? – R.. GitHub STOP HELPING ICE Jan 18 '12 at 05:48
@ircmaxwell: `ieee754.h` is not specified by any standard. It's a glibc-ism. – Stephen Canon Jan 29 '12 at 20:43
1

This is incorrect. Conversion of floating point numbers to integers is undefined behavior if the integral part cannot be represented in the integer type. See C99 standard 6.3.1.4p1. – Ambroz Bizjak May 24 '13 at 17:14
* That would be C11 standard not C99. – Ambroz Bizjak May 24 '13 at 17:20
@AmbrozBizjak That should only apply to the first solution with `(double)(long long)val == val`. I can add that disclaimer if needed. But AFAICT, the bitmask solution should be safe assuming IEEE FP since union type-punning is explicitly allowed by C11. – Mysticial May 24 '13 at 17:25
@Mysticial why bother with the negative shift? Would that not always be a number below 1 in magnitude so you could return right away? – Johannes Schaub - litb Sep 27 '15 at 21:24
@JohannesSchaub-litb Can you clarify? Are you talking about the first part of the if-statement? The `shift = -shift;`? – Mysticial Sep 28 '15 at 04:55
@Mysticial I assume the 1075 is bias that you get rid off (haven't looked it up), so shift==0 would mean multiplying with 2^0. If that's the case, and since a float's mantissa (in an integral range) is normalized, a float for a value >= 1.0 always has a shift>=0. Which is why does your `if` not look like `if(shift < 0 || shift > 63)`? – Johannes Schaub - litb Sep 28 '15 at 08:24
@JohannesSchaub-litb The shift is actually offset by 52. So `1.0` has a shift of `-52`. So anything less -52 would be less than 1. In that case, I can always return false except for zero. On that note, it looks like I overlooked zero when I wrote this snippet so I've fixed that. – Mysticial Sep 28 '15 at 14:41
Ah, I see what you did now. The test against `>63` is a bit confusing though, it suffices if it tests against `>11`, because of the bias and because the leading 1 would be shifted out of the uint64 anyway and compare unequal to `man`. In order to understand your approach, I compared it to two alternatives, see http://coliru.stacked-crooked.com/a/d913bc886e9e6d41 . I belive your code checks for compatibility with `unsigned long`, 64bit. But the question asked for compatibility with `long`, 64bit, so you should lower your exponent check. See the linked paste's output. – Johannes Schaub - litb Sep 28 '15 at 21:36
Can you also please explain why you did choose those particular numbers? What advantage does it have over subtracting the actual bias? – Johannes Schaub - litb Sep 28 '15 at 22:02
I'll fix this up when I get home tonight. Indeed this only checks if it fits in an unsigned 64-bit integer. IIRC, this was a quick-and-dirty implementation that wasn't particularly well-optimized. And I missed some corner cases. The idea was that I start with the original mantissa in the 64-bit integer. If the exponent is positive, I shift up and down. If the number doesn't fit in a 64-bit integer, the shift overflows and the compare fails. If the exponent is negative, it shifts down and up. If any lower-order bits are lost, the compare fails. – Mysticial Sep 28 '15 at 22:21
For the signed/unsigned thing, there's one corner case that needs to be dealt with. -2^63 fits, but +2^63 does not. – Mysticial Sep 28 '15 at 22:27
Using union like this invokes unspecified behaviour (C11 6.2.6.1.7), making it not portable. One portable way of doing that kind of conversion is through `memcpy()` to another variable. But then again, most compilers specify the behaviour that is assumed here; for example, [here's where GCC specifies it](https://gcc.gnu.org/onlinedocs/gcc/Structures-unions-enumerations-and-bit-fields-implementation.html#Structures-unions-enumerations-and-bit-fields-implementation). – Pedro Gimeno Apr 09 '19 at 15:10

score 4 · Answer 3 · answered Jan 15 '13 at 08:36

You can use the modf function to split a float into the integer and fraction parts. modf is in the standard C library.

#include <math.h>
#include <limits.h>   

double val = ...
double i;
long l;

/* check if fractional part is 0 */
if (modf(val, &i) == 0.0) {
    /* val is an integer. check if it can be stored in a long */
    if (val >= LONG_MIN && val <= LONG_MAX) {
        /* can be exactly represented by a long */
        l = val;
    }
}

score 1 · Answer 4 · edited Jun 20 '20 at 09:12

How to check if float can be exactly represented as an integer ?

I'm looking to for a reasonably efficient way of determining if a floating point value double can be exactly represented by an integer data type long, 64 bit.

Range (LONG_MIN, LONG_MAX) and fraction (frexp()) tests needed. Also need to watch out for not-a-numbers.

The usual idea is to test like (double)(long)x == x, but to avoid its direct usage. (long)x, when x is out of range, is undefined behavior (UB).

The valid range of conversion for (long)x is LONG_MIN - 1 < x < LONG_MAX + 1 as code discards any fractional part of x during the conversion. So code needs to test, using FP math, if x is in range.

#include <limits.h>
#include <stdbool.h>
#define DBL_LONG_MAXP1 (2.0*(LONG_MAX/2+1)) 
#define DBL_LONG_MINM1 (2.0*(LONG_MIN/2-1)) 

bool double_to_long_exact_possible(double x) {
  if (x < DBL_LONG_MAXP1) {
    double whole_number_part;
    if (frexp(x, &whole_number_part) != 0.0) {
      return false;  // Fractional part exist.
    }
    #if -LONG_MAX == LONG_MIN
    // rare non-2's complement machine 
    return x > DBL_LONG_MINM1;
    #else
    return x - LONG_MIN > -1.0;
    #endif 
  }
  return false;  // Too large or NaN
}

score 0 · Answer 5 · answered Sep 05 '13 at 02:52

Any IEEE floating-point double or float value with a magnitude at or above 2^52 or 2^23 will be whole number. Adding 2^52 or 2^23 to a positive number whose magnitude is less than that will cause it to be rounded to a whole number. Subtracting the value that was added will yield a whole number which will equal the original iff the original was a whole number. Note that this algorithm will fail with some numbers larger than 2^52, but it isn't needed for numbers that big.

score -1 · Answer 6 · answered Jan 18 '12 at 04:49

-1

Could you use the modulus operator to check if the double is divisible by one... or am I completely misunderstanding the question?

double val = ... ;  //  Value

if(val % 1 == 0) {
    // Val is evenly divisible by 1 and is therefore a whole number
}

answered Jan 18 '12 at 04:49

pantryfight

338
1
3
13

2

**R** % 1 == 0 is always true. – Jan 18 '12 at 04:54
2

I don't think `double` has a % operator does it? `double` doesn't have an `operator%` – Seth Carnegie Jan 18 '12 at 04:55

How to check if float can be exactly represented as an integer

6 Answers6

Linked