When is integer to floating point conversion lossless?

Question

Particularly I'm interested if int32_t is always losslessly converted to double.

Does the following code always return true?

int is_lossless(int32_t i)
{
    double   d = i;
    int32_t i2 = d;
    return (i2 == i);
}

What is for int64_t?

Obviously not the case for `int64_t`, as the mantissa is only 52bits in size. But for `int32_t` that should hold unconditionally. — Ext3h, Aug 04 '20 at 09:59

chux - Reinstate Monica · Answer 1 · 2020-08-05T16:58:00.700

When is integer to floating point conversion lossless?

When the floating point type has enough precision and range to encode all possible values of the integer type.

Does the following int32_t code always return true? --> Yes.
Does the following int64_t code always return true? --> No.

As DBL_MAX is at least 1E+37, the range is sufficient for at least int122_t, let us look to precision.

With common double, with its base 2, sign bit, 53 bit significand, and exponent, all values of int54_t with its 53 value bits can be represented exactly. INT54_MIN is also representable. With this double, it has DBL_MANT_DIG == 53 and in this case that is the number of base-2 digits in the floating-point significand.

The smallest magnitude non-representable value would be INT54_MAX + 2. Type int55_t and wider have values not exactly representable as a double.

With uintN_t types, there is 1 more value bit. The typical double can then encode all uint53_t and narrower.

With other possible double encodings, as C specifies DBL_DIG >= 10, all values of int34_t can round trip.

Code is always true with int32_t, regardless of double encoding.

What is for int64_t?

UB potential with int64_t.

The conversion in int64_t i ... double d = i;, when inexact, makes for a implementation defined result of the 2 nearest candidates. This is often a round to nearest. Then i values near INT64_MAX can convert to a double one more than INT64_MAX.

With int64_t i2 = d;, the conversion of the double value one more than INT64_MAX to int64_t is undefined behavior (UB).

A simple prior test to detect this:

#define INT64_MAX_P1 ((INT64_MAX/2 + 1) * 2.0)
if (d == INT64_MAX_P1) return false;  // not lossless

bruno · Answer 2 · 2020-08-04T15:51:59.247

2

Note : my answer supposes the double follow IEEE 754, and both int32_t and int64_tare 2's complement.

Does the following code always return true?

the mantissa/significand of a double is longer than 32b so int32_t => double is always done without error because there is no possible precision error (and there is no possible overflow/underflow, the exponent cover more than the needed range of values)

What is for int64_t?

but 53 bits of mantissa/significand (including 1 implicit) of a double is not enough to save 64b of a int64_t => int64_t having upper and lower bits enough distant cannot be store in a double without precision error (there is still no possible overflow/underflow, the exponent still cover more than the needed range of values)

edited Aug 04 '20 at 15:51

answered Aug 04 '20 at 09:59

bruno

32,421
7
25
37

1

Integers are stored with both non-zero exponent and mantissa, your logic is incorrect. Play around and see how integers convert to floating-point binary representation https://www.h-schmidt.net/FloatConverter/IEEE754.html For example, 31 is stored as 1.9375 mantissa and 131 exponent. – Maxim Egorushkin Aug 04 '20 at 10:10
@MaximEgorushkin I never said `Integers are stored with both non-zero exponent ...`, but without overflow/underflow problem from an int32 the only possible problem is the precision directly linked with the size of the mantisa, and that mantissa is enough large from any int32 => no problem with int32. From an int 64 all is different. My logic is right, your way to read my answer is wrong ;-) – bruno Aug 04 '20 at 10:27
@bruno You implied that a 32-bit integer is stored entirely in mantissa: _the mantissa of a double is longer than 32b so int32_t => double is always done without error_. – Maxim Egorushkin Aug 04 '20 at 10:30
@DanielLangr See https://www.exploringbinary.com/number-of-digits-required-for-round-trip-conversions/ – Maxim Egorushkin Aug 04 '20 at 10:30
@MaximEgorushkin again no, I know how floatting point are represented, do you understand what the precision/underlow/overflow are ? – bruno Aug 04 '20 at 10:32
2

@MaximEgorushkin Could you provide a single example of a 32-bit integer that cannot be stored in `double`? (Anyway sorry, 31 is actually stored as 1.1111b shifted by 4, there is an implicit 1 in front of the decimal point.) – Daniel Langr Aug 04 '20 at 10:32
@MaximEgorushkin again, an int cannot be represented without error in a floating point number if there is underflow/overflow/precision problem, and these problems cannot occurs from an int32 for a double (of course this is not the same for a float), but they occurs for some int64 – bruno Aug 04 '20 at 10:39
@bruno A `double` can store any 32-bit integer, but your logic and explanation are totally wrong. For example, a floating point number with 32-bit mantissa and 1-bit exponent and 1-bit sign can only exactly represent integers in range `[-1,1]`. – Maxim Egorushkin Aug 04 '20 at 10:41
1

@MaximEgorushkin I don't understand what is wrong with bruno's explanation. An (unsigned) integer is stored in FP such that all bits right to the highest-set bit are stored in mantisa (plus exponent is calculated and coded). Since there cannot be more than 32 such bits for a 32-bit integer, once the mantisa has more than 32 bits, there cannot be any loss (provided the exponent can be represented, which is always true for `int32_t` and `double`). – Daniel Langr Aug 04 '20 at 10:47
@MaximEgorushkin `a floating point number with 32-bit mantissa and 1-bit exponent can only exactly represent integers in range [0,2)` what are you speaking about that ? C and C++ follow the IEEE floating point norm and there is no possible overflow/underflow for an int32. You know I can also make my own Int 32b representation allowing to save the special number 5656536552635656356523656536567235625365263562563567536 whose cannot be represented by an double, does hat have sense ? no ! – bruno Aug 04 '20 at 10:48
1

I wouldn't just agree with _"big `int64_t` cannot be stored in a `double` without precision error"_. It's not about big numbers, it's about numbers that has set bits highly distant from each other. For instance, there shouldn't be any problem with 2^62 being stored in `double`: [live demo](https://www.binaryconvert.com/result_double.html?decimal=052054049049054056054048049056052050055051056055057048052), but 2^62+1 can't be represented since the lowest set bit is "lost". – Daniel Langr Aug 04 '20 at 10:51
1

@DanielLangr I agree 'big' is wrongly said (I wanted to say 'big' is number of bits) , I edited my answer, thank you – bruno Aug 04 '20 at 10:57
2

@MaximEgorushkin: The phrasing in this answer is correct. The fact that the significand of a `double` exceeds 32 bits implies the `double` has sufficient precision to exactly represent any 32-bit integer. This answer does not state the integer is stored “entirely” in the significand. It is true the exponent must have sufficient range, but that is a minor quibble and has been addressed. – Eric Postpischil Aug 04 '20 at 13:05
2

Note that the preferred term for the fraction portion of a floating-point representation is “significand,” not “mantissa,” as used in the IEEE-754 standard. A “mantissa” is the fraction portion of a logarithm. Significands are linear; multiplying a significand by, say, 1.1 multiples the represented value by 1.1. Mantissas are logarithmic; adding to a mantissa multiples the represented value. – Eric Postpischil Aug 04 '20 at 13:07
@EricPostpischil ok I replaced *mantissa* by *mantissa/significand*, but is it really a 'd' at the end rather than a 't' ? – bruno Aug 04 '20 at 13:13
@EricPostpischil The answer implies that an integer is stored solely in mantissa. However, for example, `1` and `2` are stored as the same mantissa but different exponent. This is a wrong explanation because it ignores the exponent. – Maxim Egorushkin Aug 04 '20 at 13:28
2

@MaximEgorushkin: No, it does not. The answer states the significand supplies sufficient precision for the task. It does. The fact that 1 and 2 are stored with the same significand is irrelevant; in each case, the significand supplies sufficient precision for the task, and that is what is stated in this answer. Your understanding of the language is simply incorrect. – Eric Postpischil Aug 04 '20 at 13:30
1

@bruno: The point Maxim is making is that the significand being large enough to store an integer is necessary for preserving the exact value of an int in a floating-point conversion, but not *sufficient* to preserve it. You also need sufficient exponent bits to stores the integer value as its proper value in the float. You're correct that a double can do it, but your explanation of *why* a double can do it is missing key information (the role of exponent bits). – Nicol Bolas Aug 04 '20 at 14:15
@NicolBolas It seems from my answer you missed `(and there is no possible overflow/underflow of course)`. From your remark please note the number of bits of the exponent is not enough alone, the bias is also relevant. Anyway knowing the IEEE-764 encoding to enter in that details is useless – bruno Aug 04 '20 at 14:53
1

@bruno: As far as I understand, the bias in IEEE-754 is *defined by* the number of exponent bits (ie: `bias = (2^(N-1)) - 1`, where N is the number of exponent bits). So the number of bits in the exponent directly correlates to the bias. Also, your statement is a parenthetical and therefore is not given equivalent weight to the number of mantissa bits. – Nicol Bolas Aug 04 '20 at 14:57
@NicolBolas this is because the bias is 'well chosen' for IEEE-754 (without surprise ^^) , but in the absolute the same encoding can be used with an other bias allowing a different range of values. My answer is for IEEE-754, but also implicitly for complement 2 integer, else even a *double* can be not enough choosing a (unusable) int encoding on 32b where all value are out of a double range because all bits to 0 means 1e400 – bruno Aug 04 '20 at 15:07
2

@bruno: "*but also implicitly for complement 2 integer*" No, it's *explicitly* for a 2's complement integer, because all of the `int**_t` types are required by the standard to be 2's complement. Other integer types (pre-C++20) may or may not be 2's complement, but those specific types if provided by an implementation *must* be 2's complement. – Nicol Bolas Aug 04 '20 at 15:07
@NicolBolas Out of that, a double cannot represent exactly all the int64, it is because of the exponant&bias ? no ! this is because of the mantisa/significand, so pity ... – bruno Aug 04 '20 at 15:16
1

@bruno Note that C++ standard doesn't require IEEE 754. See `std::numeric_limits<>::is_iec559`. – Maxim Egorushkin Aug 04 '20 at 15:39
@MaximEgorushkin I effectively supposed that, thank you for the remark. Finally I edited my answer to put encodings explicit rather than (possibly wrongly) implicit. – bruno Aug 04 '20 at 15:47
@NicolBolas finally the remark of Maxim Egorushkin added to yours made me edit my answer. I was far to imagine to many remarks for a problem quite simple ... – bruno Aug 04 '20 at 15:58

kvantour · Accepted Answer · 2020-08-05T15:02:19.933

Question: Does the following code always return true?

Always is a big statement and therefore the answer is no.

The C++ Standard makes no mention whether or not the floating-point types which are known to C++ (float, double and long double) are of the IEEE-754 type. The standard explicitly states:

There are three floating-point types: float, double, and long double. The type double provides at least as much precision as float, and the type long double provides at least as much precision as double. The set of values of the type float is a subset of the set of values of the type double; the set of values of the type double is a subset of the set of values of the type long double. The value representation of floating-point types is implementation-defined. [Note: This document imposes no requirements on the accuracy of floating-point operations; see also [support.limits]. — end note] Integral and floating-point types are collectively called arithmetic types. Specialisations of the standard library template std::numeric_limits shall specify the maximum and minimum values of each arithmetic type for an implementation.

_{source: C++ standard: basic fundamentals}

Most commonly, the type double represents the IEEE 754 double-precision binary floating-point format binary64, and can be depicted as:

and decoded as:

However, there is a plethora of other floating-point formats out there that are decoded differently and not necessarly have the same properties as the well known IEEE-754. Nonetheless, they are all-by-all similar:

They are n bits long
One bit represents the sign
m bits represent the significant with or without a hidden first bit
e bits represent some form of an exponent of a given base (2 or 10)

To know Whether or not a double can represent all 32-bit signed integer or not, you must answer the following question (assuming our floating-point number is in base 2):

Does my floating-point representation have a hidden first bit in the significant? If so, assume m=m+1
A 32bit signed integer is represented by 1 sign bit and 31 bits representing the number. Is the significant large enough to hold those 31 bits?
Is the exponent large enough that it can represent a number of the form 1.xxxxx 2^31?

If you can answer yes to the last two questions, then yes a int32 can always be represented by the double that is implemented on this particular system.

Note: I ignored decimal32 and decimal64 numbers, as I have no direct knowledge about them.

C/C++ requires certain FP requirements concern precision and range. Even without IEEE-754 compliance, `int32_t` round-trips exactly through `double`.given the minimal language requirements of `DBL_DIG >= 10` and `DBL_MAX >= 1e37`. Re: "a double can represent all 32-bit signed integer or not," requirement 1 is an implementation detail. Reg #2 Is yes for `double` per the language requirements. 3. is also yes per the language requirements. — chux - Reinstate Monica, Aug 05 '20 at 17:00
@chux could you point me to a reference where this is stated? I've been browsing trough the standard and could not find any minimum requirements for `float`, `double` or `long double`. I did find some for the integer types. — kvantour, Aug 05 '20 at 17:47
So usually floating point can hold losslessly only _m+2_ bits integer (_m_ is without hidden first bit). — anton_rh, Aug 05 '20 at 18:25
Also we should consider the corner case value. `int16_t` maximum is _32 767_, but minimum is _-32 768_. This _-32 768_ couldn't be stored if _m == 14_. But it actually can, because _-32 768 == 1.000 * 2^15_ (where _1.000_ is fraction and _15_ is exponent). — anton_rh, Aug 05 '20 at 18:26
@chux-ReinstateMonica, this is for the C standard. C++ on the other hand, uses the same variable names and definitions, but does not make any restriction to it (I believe). This is stated in the standard (marked section in this answer) and also see [here](https://stackoverflow.com/questions/34294938/does-the-c-standard-specify-anything-on-the-representation-of-floating-point-n) — kvantour, Aug 06 '20 at 09:24
C++ spec, in _numerics_limits_, does footnote these as equivalent to the `DBL_...` C counterparts which do have mins/maxs. Perhaps a SO question is needed? — chux - Reinstate Monica, Aug 06 '20 at 15:07
@chux-ReinstateMonica See [this question](https://stackoverflow.com/q/63489656/8344060) — kvantour, Aug 19 '20 at 14:56

Bathsheba · Answer 4 · 2020-08-04T13:24:32.680

1

If your platform uses IEEE754 for the double, then yes, any int32_t can be represented perfectly in a double. This is not the case for all possible values that an int64_t can have.

(It is possible on some platforms to tweak the mantissa / exponent sizes of floating point types to make the transformation lossy, but such a type would not be an IEEE754 double.)

To test for IEEE754, use

static_assert(std::numeric_limits<double>::is_iec559, "IEEE 754 floating point");

edited Aug 04 '20 at 13:24

answered Aug 04 '20 at 10:52

Bathsheba

231,907
34
361
483

The fact that IEEE-754 is used is insufficient. That IEEE-754 binary64 is used for `double` would be sufficient. – Eric Postpischil Aug 04 '20 at 13:11
1

A C++ implementation may choose not to assert `is_iec559` because it does not full conform in its arithmetic and other behavior, even though it uses the IEEE-754 binary64 format for double. Various values provided in the `` template can be used to test whether there is enough precision in the type to represent all values of an `int32_t`. – Eric Postpischil Aug 04 '20 at 13:13

When is integer to floating point conversion lossless?

4 Answers4

Linked