How to test for lossless double / integer conversion?

Question

I have one double, and one int64_t. I want to know if they hold exactly the same value, and if converting one type into the other does not lose any information.

My current implementation is the following:

int int64EqualsDouble(int64_t i, double d) {
    return (d >= INT64_MIN)
        && (d < INT64_MAX)
        && (round(d) == d)
        && (i == (int64_t)d);
}

My question is: is this implementation correct? And if not, what would be a correct answer? To be correct, it must leave no false positive, and no false negative.

Some sample inputs:

int64EqualsDouble(0, 0.0) should return 1
int64EqualsDouble(1, 1.0) should return 1
int64EqualsDouble(0x3FFFFFFFFFFFFFFF, (double)0x3FFFFFFFFFFFFFFF) should return 0, because 2^62 - 1 can be exactly represented with int64_t, but not with double.
int64EqualsDouble(0x4000000000000000, (double)0x4000000000000000) should return 1, because 2^62 can be exactly represented in both int64_t and double.
int64EqualsDouble(INT64_MAX, (double)INT64_MAX) should return 0, because INT64_MAX can not be exactly represented as a double
int64EqualsDouble(..., 1.0e100) should return 0, because 1.0e100 can not be exactly represented as an int64_t.

Your code doesn't seem unreasonable to me. Assuming a sane floating-point environment anyway, in the absence of `__STDC_IEC_559__` guarantees all bets are off and you'll probably need to resort to platform-specific hacks. — doynax, Nov 15 '15 at 13:23
You could probably use `frexp, frexpf, frexpl` functions that deconstruct floating-point number to fractional and integral components then compare the integral component and zero exponen. — mbaitoff, Nov 17 '15 at 10:40
tangential question: the specification of `round` includes "The round functions round their argument to the nearest integer value". What happens for numbers which aren't integers but the nearest integer value is not exactly representible? (Are there any such values?) — M.M, Nov 18 '15 at 03:17
@M.M There are none, as long a `x` is finite. "integers" for `round()` refers to the mathematical sense, not the members of some integer type like `intmax_t`. — chux - Reinstate Monica, Nov 18 '15 at 05:29
@chux I mean, maybe there is a situation such as the exactly representible numbers going 100001.34, 100002.56, 100003.78 , and if so, what does `round(100002.56)` give — M.M, Nov 18 '15 at 09:51
@M.M Finite floating point numbers _always_ have magnitudes values like `a(i)*pow(FLT_RADIX,i) + a(i-1)*pow(FLT_RADIX,i-1) + a(i-2)*pow(FLT_RADIX,i-2) ...` where coefficient `a(j)` is in the integer range [0...FLT_RADIX-1]. To "floor" such a number to an integer, drop all terms where `i-j` is negative. To "ceiling" any of these numbers with a negative `i-j` term can have 1 added to to the "floor" value. Numbers without a negative `i-j` are all ready integers. IOWs "exactly representable FP numbers going 100001.34, 100002.56, 100003.78" cannot exist as they do not fit floating point form. — chux - Reinstate Monica, Nov 18 '15 at 14:31
I would suggest to use this instead: `return (double)i == d && i == (int64_t)d;` in precisely this order. If the first condition succeeds, it is certain that `d` can actually be represented as an `uint64_t`. However, it may have been compared to a rounded `i`, so that first condition is not sufficient. That's what the second condition is there to check. Since `d` has been proven to be exactly representable, the cast is lossless, and the result of the comparison true to the exact values of both arguments. — cmaster - reinstate monica, Nov 23 '16 at 12:02
int32 to double conversion is always [lossless](https://stackoverflow.com/q/63244349/5447906). But, of course, it is not true for int64. — anton_rh, Aug 10 '20 at 04:53

Pascal Cuoq · Accepted Answer · 2016-11-23T11:51:09.050

16

Yes, your solution works correctly because it was designed to do so, because int64_t is represented in two's complement by definition (C99 7.18.1.1:1), on platforms that use something resembling binary IEEE 754 double-precision for the double type. It is basically the same as this one.

Under these conditions:

d < INT64_MAX is correct because it is equivalent to d < (double) INT64_MAX and in the conversion to double, the number INT64_MAX, equal to 0x7fffffffffffffff, rounds up. Thus you want d to be strictly less than the resulting double to avoid triggering UB when executing (int64_t)d.
On the other hand, INT64_MIN, being -0x8000000000000000, is exactly representable, meaning that a double that is equal to (double)INT64_MIN can be equal to some int64_t and should not be excluded (and such a double can be converted to int64_t without triggering undefined behavior)

It goes without saying that since we have specifically used the assumptions about 2's complement for integers and binary floating-point, the correctness of the code is not guaranteed by this reasoning on platforms that differ. Take a platform with binary 64-bit floating-point and a 64-bit 1's complement integer type T. On that platform T_MIN is -0x7fffffffffffffff. The conversion to double of that number rounds down, resulting in -0x1.0p63. On that platform, using your program as it is written, using -0x1.0p63 for d makes the first three conditions true, resulting in undefined behavior in (T)d, because overflow in the conversion from integer to floating-point is undefined behavior.

If you have access to full IEEE 754 features, there is a shorter solution:

#include <fenv.h>
…
#pragma STDC FENV_ACCESS ON
feclearexcept(FE_INEXACT), f == i && !fetestexcept(FE_INEXACT)

This solution takes advantage of the conversion from integer to floating-point setting the INEXACT flag iff the conversion is inexact (that is, if i is not representable exactly as a double).

The INEXACT flag remains unset and f is equal to (double)i if and only if f and i represent the same mathematical value in their respective types.

This approach requires the compiler to have been warned that the code accesses the FPU's state, normally with #pragma STDC FENV_ACCESS on but that's typically not supported and you have to use a compilation flag instead.

edited Nov 23 '16 at 11:51

answered Nov 17 '15 at 10:15

Pascal Cuoq

79,187
7
161
281

You felt the need to add the "on platforms that use 2's complement for integers and something resembling binary IEEE 754 for floating-point" precision on your initial answer. So... It looks like you're not quite as sure as you sound, or that my question is badly written. – Gwendal Roué Nov 17 '15 at 10:36
The context is SQlite, BTW. This database may store numerical values as doubles, or int64. My own users may give me doubles or int64. The question I want to answer is the following: "Given the value that I know is already stored in the database, and this new value you give me in, is an UPDATE statement actually necessary in order to persist the new value?". Int/int comparison is trivial, so is double/double. Remains the Double/Int comparison :-) – Gwendal Roué Nov 17 '15 at 10:41
Can you explain why would integer implementation mattered here. Because of negative zero? Isn't -0 == 0.0 true? – this Nov 17 '15 at 12:39
1

1/ I am sure. It is my job to approach the properties that programs have or do not have as theorems, and theorems have hypotheses. If the platform is not two's complement, then your first two conditions may be subtly wrong, and if it's not ieee 754-style binary floating-point, all bets are off. Stating hypotheses doesn't make one unsure, and if you think the correlation goes this way I have a bridge to sell you. – Pascal Cuoq Nov 17 '15 at 14:20
@this `d < INT64_MAX` is correct because it is equivalent to `d < (double) INT64_MAX` and that in the conversion to `double`, **assuming 2's complement**, the number `INT64_MAX` rounds **up** in that conversion. Thus you want `d` to be strictly less than the rsulting double to avoid triggering UB on `(int64_t)d`. – Pascal Cuoq Nov 17 '15 at 14:27
@this OTOH, still assuming 2's complement, `INT64_MIN` is exactly representable, meaning that a double that is equal to `(double)INT64_MIN` can be equal to an `int64_t` and should not be excluded (and can be converted to `int64_t` without triggering UB). – Pascal Cuoq Nov 17 '15 at 14:29
@this All this reasoning assumes binary IEEE 754-like floating-point in addition to 2's complement for integers because otherwise `INT64_MAX` may not round up when converted and `INT64_MIN` may not be exactly representable. – Pascal Cuoq Nov 17 '15 at 14:31
@PascalCuoq I see. Thank you. – this Nov 17 '15 at 14:36
@GwendalRoué One thing that's not clear about your question is whether you want to know if it's correct on any platform (and in this case the answer is no: take a platform with binary 64-bit floating-point and 1's complement, and your program triggers UB in `(int64_t)d` each time `d` has been assigned with `d = (double) INT64_MIN;` before invoking your code) or whether it's correct on the usual 2's complement IEEE 754 floating-point platform (yes). – Pascal Cuoq Nov 17 '15 at 14:46
@PascalCuoq: Yes, I wanted to avoid overflow and this is why I used < INT64_MAX and >= INT64_MIN. Thanks for the explanations on those conditions, and their implicit reliance on 2's complement, which I was not getting well. – Gwendal Roué Nov 17 '15 at 15:03
@GwendalRoué What is “typically not supported” is `#pragma STDC FENV_ACCESS`, but compilers usually provide other means to write code that accesses the FPU state, so that the qualification “typically not supported” does not apply to the solution. – Pascal Cuoq Nov 17 '15 at 15:06
@PascalCuoq I forgot that intN_t types are defined to be two's complement anyway. – this Nov 17 '15 at 15:36
The answer makes the reasonable assumption the the precision of `double` is less than about 64-bit. Yet is `double` had 64-ish bits of precision, code should be `&& (d <= INT64_MAX)` `<=` vs `<`. Hmmm see something like this mentioned in [comment](http://stackoverflow.com/questions/33719132/how-to-test-for-lossless-double-integer-conversion#comment55286426_33754329) – chux - Reinstate Monica Nov 17 '15 at 17:53
@chux I have made the reference to IEEE 754 double-precision more explicit, to avoid saying something false. I hope that a reader that would have a `double` type that could represent exactly `INT64_MIN` and `INT64_MAX` would infer from the reasoning that then `<=` is appropriate for both tests. – Pascal Cuoq Nov 17 '15 at 18:46
@Olaf : I have evidence for mandatory two's complement encoding of int64_t at https://en.wikibooks.org/wiki/C_Programming/C_Reference/stdint.h – Gwendal Roué Nov 17 '15 at 19:27

chux - Reinstate Monica · Answer 2 · 2015-11-18T14:13:21.043

OP's code has a dependency that can be avoided.

For a successful compare, d must be a whole number and round(d) == d takes care of that. Even d, as a NaN would fail that.

d must be mathematically in the range of [INT64_MIN ... INT64_MAX] and if the if conditions properly insure that, then the final i == (int64_t)d completes the test.

So the question comes down to comparing INT64 limits with the double d.

Let us assume FLT_RADIX == 2, but not necessarily IEEE 754 binary64.

d >= INT64_MIN is not a problem as -INT64_MIN is a power of 2 and exactly converts to a double of the same value, so the >= is exact.

Code would like to do the mathematical d <= INT64_MAX, but that may not work and so a problem. INT64_MAX is a "power of 2 - 1" and may not convert exactly - it depends on if the precision of the double exceeds 63 bits - rendering the compare unclear. A solution is to halve the comparison. d/2 suffers no precision loss and INT64_MAX/2 + 1 converts exactly to a double power-of-2

d/2 < (INT64_MAX/2 + 1)

[Edit]

// or simply
d < ((double)(INT64_MAX/2 + 1))*2

Thus if code does not want to rely on the double having less precision than uint64_t. (Something that likely applies with long double) a more portable solution would be

int int64EqualsDouble(int64_t i, double d) {
    return (d >= INT64_MIN)
        && (d < ((double)(INT64_MAX/2 + 1))*2)  // (d/2 < (INT64_MAX/2 + 1))
        && (round(d) == d)
        && (i == (int64_t)d);
}

Note: No rounding mode issues.

[Edit] Deeper limit explanation

Insuring mathematically, INT64_MIN <= d <= INT64_MAX, can be re-stated as INT64_MIN <= d < (INT64_MAX + 1) as we are dealing with whole numbers. Since the raw application of (double) (INT64_MAX + 1) in code is certainly 0, an alternative, is ((double)(INT64_MAX/2 + 1))*2. This can be extended for rare machines with double of higher powers-of-2 to ((double)(INT64_MAX/FLT_RADIX + 1))*FLT_RADIX. The comparison limits being exact powers-of-2, conversion to double suffers no precision loss and (lo_limit >= d) && (d < hi_limit) is exact, regardless of the precision of the floating point. Note: that a rare floating point with FLT_RADIX == 10 is still a problem.

score 3 · Answer 3 · answered Nov 23 '15 at 08:21

In addition to Pascal Cuoq's elaborate answer, and given the extra context you give in comments, I would add a test for negative zeros. You should preserve negative zeros unless you have good reasons not to. You need a specific test to avoid converting them to (int64_t)0. With your current proposal, negative zeros will pass your test, get stored as int64_t and read back as positive zeros.

I am not sure what is the most efficient way to test them, maybe this:

int int64EqualsDouble(int64_t i, double d) {
    return (d >= INT64_MIN)
        && (d < INT64_MAX)
        && (round(d) == d)
        && (i == (int64_t)d
        && (!signbit(d) || d != 0.0);
}

Hello @chqrlie. Thanks for pointing negative zero, which has been overlooked until your contribution. "According to the IEEE 754 standard, negative zero and positive zero should compare as equal with the usual (numerical) comparison operators, like the == operators of C." (https://en.wikipedia.org/wiki/Signed_zero): it's hard to tell whether the int64EqualsDouble function should ignore negative zero or not. The "converting one type into the other does not lose any information" part of the question requires it, though. +1 — Gwendal Roué, Nov 23 '15 at 08:39

How to test for lossless double / integer conversion?

3 Answers3

Linked