Why uint64_t cannot show pow(2, 64) - 1 properly?

Question

I'm trying to understand why uint64_t type can not show pow(2,64)-1 properly. The cplusplus standard is 199711L.

I checked the pow() function under C++98 standard which is

double pow (double base     , double exponent);
float pow (float base      , float exponent);
long double pow (long double base, long double exponent);
double pow (double base     , int exponent);
long double pow (long double base, int exponent);

So I wrote the following snippet

double max1 = (pow(2, 64) - 1);
cout << max1 << endl;

uint64_t max2 = (pow(2, 64) - 1);
cout << max2 << endl;

uint64_t max3 = -1;
cout << max3 << endl;

The outputs are:

max1: 1.84467e+019
max2: 9223372036854775808
max3: 18446744073709551615

related: https://stackoverflow.com/questions/3793838/which-is-the-first-integer-that-an-ieee-754-float-is-incapable-of-representing-e — NathanOliver, Mar 01 '19 at 16:17
To compute 2^64 - 1, 2^64 must fit in a register before subtracting 1. Granted, the result fits in a uint64_t, but not the intermediate result. — AlexG, Mar 01 '19 at 16:19
@AlexG, Thanks. I thought it is the case. However, it can not explain why double max1 = (pow(2, 64) - 1); shows correctly. — Diame, Mar 01 '19 at 16:21
Are you certain it does use the integer version of the pow function? — AlexG, Mar 01 '19 at 16:22
One thing to be noted: IEEE 754 (which double *usually* follows) defines 52 bit for the mantissa - which limits double's precision (in normalized mode, you have one *implicit* bit more, but still). So you won't ever be able to represent `2^64-1` correctly in double either, but actually suffer from rounding. — Aconcagua, Mar 01 '19 at 16:24
@Diame `double` doesn't, but you aren't checking down to the digits that are wrong — Caleth, Mar 01 '19 at 16:26
@Justin you lack `#include ` and (possibly) `std::` from the call site — Caleth, Mar 01 '19 at 16:29
@Caleth No, I declared the functions. I could've renamed it to `foo(...)`. I was just showing that the stated overload set can't compile with the given expression. It probably would have made more sense to do it that way, though. — Justin, Mar 01 '19 at 16:30
`uint64_t max2b = pow(2.0, 64.0) - 2048.0;` will do closer to what you want, since 2048.0 is epsilon at that range. (Assuming IEEE 754.) — Eljay, Mar 01 '19 at 16:36

eerorika · Accepted Answer · 2019-03-01T16:52:52.037

Floating point numbers have finite precision.

On your system (and typically, assuming binary64 IEEE-754 format) 18446744073709551615 is not a number that has a representation in the double format. The closest number that does have a representation happens to be 18446744073709551616.

Subtracting (and adding) together two floating point numbers of wildly different magnitudes usually produces an error. This error can be significant in relation to the smaller operand. In the case of 18446744073709551616. - 1. -> 18446744073709551616. the error of the subtraction is 1, which is in fact the same value as the smaller operand.

When a floating point value is converted to an integer type, and the value cannot fit in the integer type, the behaviour of the program is undefined - even when the integer type is unsigned.

phuclv · Answer 2 · 2020-10-25T13:01:31.897

TL;DR: It's not that uint64_t type cannot show pow(2,64)-1 properly but the reverse: double can't store precisely 2⁶⁴ - 1 due to the lack of significand bits. You can only do that with types with 64 bits of precision or more (like long double on many platforms). Try std::pow(2.0L, 64) - 1.0L (note the L suffix) or powl(2.0L, 64) - 1.0L; and see

Anyway you shouldn't use a floating-point type for integer math right from the beginning. Not only it's far slower to calculate pow(2, x) than 1ULL << x, it'll also cause the issue you saw due to the limited precision of double. Use uint64_t max2 = -1 instead, or ((unsigned __int128)1ULL << 64) - 1 if the compiler supports __int128

pow(2, 64) - 1 is a double expression, not int, as pow doesn't have any overload that returns an integral type. The integer 1 will be promoted to the same rank as the result of pow

However because IEEE-754 double precision is only 64-bit long, you can never store values that have 64 significant bits or more like 2⁶⁴-1

So pow(2, 64) - 1 will be rounded to the closest representable value, which is pow(2, 64) itself, and pow(2, 64) - 1 == pow(2, 64) will result in 1. The largest value that's smaller than it is 18446744073709549568 = 2⁶⁴ - 2048. You can check that with std::nextafter

On some platforms (notably x86, except on MSVC) long double does have 64 bits of significand, so you'll get the correct value in that case. The following snippet

double max1 = pow(2, 64) - 1;
std::cout << "pow(2, 64) - 1 = " << std::fixed << max1 << '\n';
std::cout << "Previous representable value: " << std::nextafter(max1, 0) << '\n';
std::cout << (pow(2, 64) - 1 == pow(2, 64)) << '\n';

long double max2 = pow(2.0L, 64) - 1.0L;
std::cout << std::fixed << max2 << '\n';

prints out

pow(2, 64) - 1 = 18446744073709551616.000000
Previous representable value: 18446744073709549568.000000
1
18446744073709551615.000000

You can clearly see long double can store the correct value as expected

On many other platforms double may be IEEE-754 quadruple-precision or double-double. Both have more than 64 bits of significand so you can do the same thing. But of course the overhead will be higher

Why uint64_t cannot show pow(2, 64) - 1 properly?

2 Answers2

Linked

Related