Casting from int64_t to double to int64_t again changes it's value

Question

int main()
{
  int64_t iaVal = (int64_t)9007199258935295;
  double daVal = (double)iaVal;
  std::cout << "original " << iaVal << "\tAfter conversion  "  << (std::int64_t)daVal <<  std::endl;
}

Output :

Original          9007199258935295  
After conversion  9007199258935296

How can I get the correct value from double?

Short answer: You can't! A 64-bit integer can hold more significant decimal digits than a standard IEEE (64-bit) double precision floating-point representation. — Adrian Mole, Nov 18 '20 at 10:13
Related: [Is floating point math broken?](https://stackoverflow.com/questions/588004/is-floating-point-math-broken) — Yksisarvinen, Nov 18 '20 at 10:17
Does this answer your question? [Is floating point math broken?](https://stackoverflow.com/questions/588004/is-floating-point-math-broken) — underscore_d, Nov 18 '20 at 10:25
Have you tried `long double`? https://stackoverflow.com/questions/52762881/is-long-double-in-c-an-implementation-of-ieees-binary128 — Den-Jason, Nov 18 '20 at 10:56

score 5 · Answer 1 · answered Nov 18 '20 at 10:21

From Double-precision floating-point format: IEEE 754 double-precision binary floating-point format: binary64 [emphasis mine]:

Double-precision binary floating-point is a commonly used format on PCs, due to its wider range over single-precision floating point, in spite of its performance and bandwidth cost. It is commonly known simply as double. The IEEE 754 standard specifies a binary64 as having:

Sign bit: 1 bit

Exponent: 11 bits

Significand precision: 53 bits (52 explicitly stored)

a double-precision IEEE 754-compliant binary floating-point have a significand precision of 53 bits, whereas a signed integer of 64 bits (int64_t) naturally have a precision of 64 bits, meaning the former will not be able to represent all values of the latter. Moreover, floating points in C++ are not even guaranteed to be IEE 754 compliant (implementation-defined), but for implementations for which they are

#include <limits>
static_assert(std::numeric_limits<double>::is_iec559, "");

as per the significant argument above, a double would be able to represent all numbers of a 32-bit integer.

eerorika · Accepted Answer · 2020-11-18T10:58:36.237

How can I get the correct value from double ?

You cannot. The value has been lost when you converted it into a type that cannot precisely represent it. Consider analogous sitation: I have converted an int value 42 to the bool value true. When I convert it back to integer, it changes it's value to 1. How can I convert to the correct value? (I cannot)

You have these options:

Only use values that are representable as double. 9007199258935295 is not representable as 64 bit binary floating point (IEEE-754)¹. All 32 bit integers are representable.
Use long double instead. Both x86 80 bit extended floating point, and 128 bit IEEE-754 floating point can represent all 64 bit integers.
Instead of finite precision, use arbitrary precision arithmetic in which case you don't need to be concerned of lack of precision. C++ standard library does not provide implementation of arbitrary precision arithmetic.

¹ Although the IEE-754 standard is ubiquitous, technically the precision of no foating point type is defined by the C++ language. It is defined by the language implementation.

Casting from int64_t to double to int64_t again changes it's value

2 Answers2