Why does casting to uint64_t change the results in this code?

Question

In the following code why do the sum and cast_sum results diverge at 2^53? The results of pow(2, 53) and (uint64_6) pow(2, 53) appear the same but when I sum them I get different results. Overall goal was to sum the results of 2^0 through 2^63. I just don't understand why using pow(2, i) fails and (uint64_t) pow(2, i) works. Or why the results of the two differ starting at 2^53.

#include <math.h>
#include <stdio.h>

int main() {
  uint64_t sum = 0;
  uint64_t cast_sum = 0;
  for (int i = 0; i < 65; ++i) {
    sum += pow(2,i);
    cast_sum += (uint64_t) pow(2,i);
    printf("i: %d, 2^%d = %lf, sum: %lu, cast_sum:%lu\n", i, i, pow(2, i), sum, cast_sum);
  }
}

i: 52, 2^52 = 4503599627370496.000000, cast of 2^52: 4503599627370496, sum: 9007199254740991, cast_sum:9007199254740991
i: 53, 2^53 = 9007199254740992.000000, cast of 2^53: 9007199254740992, sum: 18014398509481984, cast_sum:18014398509481983
i: 54, 2^54 = 18014398509481984.000000, cast of 2^54: 18014398509481984, sum: 36028797018963968, cast_sum:36028797018963967```

The 2^53 is no coincidence. 2^53 is the limit for which integers 64-bit floats (which have 52 mantissa bits plus one implicit bit) can accurately represent. `pow(2, i)` is a `double`, so `sum += pow(2, i)` will be doing a double addition, casting `sum` to a double and only casting it back afterwards. — Luatic, Jul 26 '23 at 18:15
The `double` has only 53 significant bits. And `pow()` is inaccurate. If you must use `pow()` for an integer result please use `round(pow())`. But for integer powers of 2 don't even use `pow()` use bit shifting. So `sum += (uint64_t)1 << i;` — Weather Vane, Jul 26 '23 at 18:16
As mentioned, you're mixing integer and floating point operations, and the implicit casts do not occur where you were expecting them to. But for something like this, you shouldn't be calling `pow` at all. Normally you'd change `pow(2, i)` to `((uint64_t) 1) << i`, but in this case `i` goes up to `64`, so you're going to overflow as soon as you try to force it to a `uint64_t`, regardless of whether you get the value from `pow` or from a left shift. — Tom Karzes, Jul 26 '23 at 18:22
@Luatic That makes sense, thank you. I didn't realize that sum would be cast to a double and then cast back to a uint64_t. I though that the results of pow would be cast to a uint64_t. If you make that an answer I will accept. Thanks again. — Gregory Arenius, Jul 26 '23 at 18:22
@TomKarzes The actual code I wrote only goes up to i = 63 and doesn't wrap around. I upped it a bit while trying to figure things out. You are correct that I didn't understand where the implicit casts were occurring. — Gregory Arenius, Jul 26 '23 at 18:25
@GregoryArenius Ok, so use a left shift. As a general rule, if you should avoid using `pow` when both arguments are integers. When raising `2` to an integer power, it's especially easy since you can use a left shift applied to `1`. Just make sure the `1` has the same type as the desired integer result type. — Tom Karzes, Jul 26 '23 at 18:26
And for integer powers of integers other than 2, you should have [exponentiation](https://stackoverflow.com/questions/101439/the-most-efficient-way-to-implement-an-integer-based-power-function-powint-int) in your toolbox. — Weather Vane, Jul 26 '23 at 18:30
@TomKarzes I'm coming from a Python background and just starting to learn c. Digging in where I don't understand something. A shift wasn't on my radar but it is now, as well as exponentiation. Thank you all for your help. — Gregory Arenius, Jul 26 '23 at 18:35
@GregoryArenius Python has shift operators as well. You don't need to use them to raise an integer `2` to some power in Python, but they're available, and you don't have to worry about integer overflow in Python. Try the following in Python: `2**100 == 1 << 100`. You should get `True`. — Tom Karzes, Jul 26 '23 at 18:52

score 4 · Accepted Answer · answered Jul 26 '23 at 18:24

See C's implicit conversions:

Otherwise, if one operand is double, double complex, or double imaginary (since C99), the other operand is implicitly converted as follows: integer or real floating type to double

pow(2, i) is a double. sum += pow(2,i); thus converts sum to a double before adding; it is roughly equivalent to sum = (uint64_t) (((double) sum) + pow(2, i));.

The 2^53 is no coincidence. 2^53 is the limit for which integers 64-bit floats (which have 52 mantissa bits plus one implicit bit) can accurately represent. When sum uses more than 53 significant bits and is casted to a double, some of the less significant bits will be lost in the conversion.

If instead you cast before adding, all is fine since you're working in the realm of 64-bit integer addition. Note that floats, using an exponent-mantissa representation, can exactly represent even rather large powers of two. So pow(2, i) can be exact for your small i. This can and will then be exactly converted into the appropriate 64-bit unsigned integer.

You should use bit shifts 1ULL << i instead of pow(2, i) though. Depending on how pow is implemented, it may not always be exact. Bitshifts will always be more reliable and more efficient.

If you want the bit pattern where all bits are ones, simply use ~0ULL.

Why does casting to uint64_t change the results in this code?

1 Answers1