1
int main()
{
  int64_t iaVal = (int64_t)9007199258935295;
  double daVal = (double)iaVal;
  std::cout << "original " << iaVal << "\tAfter conversion  "  << (std::int64_t)daVal <<  std::endl;
}

Output :

Original          9007199258935295  
After conversion  9007199258935296

How can I get the correct value from double?

Adrian Mole
  • 49,934
  • 160
  • 51
  • 83
  • 2
    Short answer: You can't! A 64-bit integer can hold more significant decimal digits than a standard IEEE (64-bit) double precision floating-point representation. – Adrian Mole Nov 18 '20 at 10:13
  • Related: [Is floating point math broken?](https://stackoverflow.com/questions/588004/is-floating-point-math-broken) – Yksisarvinen Nov 18 '20 at 10:17
  • Does this answer your question? [Is floating point math broken?](https://stackoverflow.com/questions/588004/is-floating-point-math-broken) – underscore_d Nov 18 '20 at 10:25
  • Have you tried `long double`? https://stackoverflow.com/questions/52762881/is-long-double-in-c-an-implementation-of-ieees-binary128 – Den-Jason Nov 18 '20 at 10:56

2 Answers2

5

From Double-precision floating-point format: IEEE 754 double-precision binary floating-point format: binary64 [emphasis mine]:

Double-precision binary floating-point is a commonly used format on PCs, due to its wider range over single-precision floating point, in spite of its performance and bandwidth cost. It is commonly known simply as double. The IEEE 754 standard specifies a binary64 as having:

  • Sign bit: 1 bit
  • Exponent: 11 bits
  • Significand precision: 53 bits (52 explicitly stored)

a double-precision IEEE 754-compliant binary floating-point have a significand precision of 53 bits, whereas a signed integer of 64 bits (int64_t) naturally have a precision of 64 bits, meaning the former will not be able to represent all values of the latter. Moreover, floating points in C++ are not even guaranteed to be IEE 754 compliant (implementation-defined), but for implementations for which they are

#include <limits>
static_assert(std::numeric_limits<double>::is_iec559, "");

as per the significant argument above, a double would be able to represent all numbers of a 32-bit integer.

dfrib
  • 70,367
  • 12
  • 127
  • 192
1

How can I get the correct value from double ?

You cannot. The value has been lost when you converted it into a type that cannot precisely represent it. Consider analogous sitation: I have converted an int value 42 to the bool value true. When I convert it back to integer, it changes it's value to 1. How can I convert to the correct value? (I cannot)

You have these options:

  • Only use values that are representable as double. 9007199258935295 is not representable as 64 bit binary floating point (IEEE-754)1. All 32 bit integers are representable.
  • Use long double instead. Both x86 80 bit extended floating point, and 128 bit IEEE-754 floating point can represent all 64 bit integers.
  • Instead of finite precision, use arbitrary precision arithmetic in which case you don't need to be concerned of lack of precision. C++ standard library does not provide implementation of arbitrary precision arithmetic.

1 Although the IEE-754 standard is ubiquitous, technically the precision of no foating point type is defined by the C++ language. It is defined by the language implementation.

eerorika
  • 232,697
  • 12
  • 197
  • 326