C++ Addition of very large unsigned long and double

Question

Following code:

#include <iostream>
#include <limits>

int main(int argc, char **argv) {
  unsigned long n = 10ul;
  unsigned long ul = std::numeric_limits<unsigned long>::max() - n;
  double d = 1.;
  ul += d;
  std::cout << ul << std::endl;
}

One might expect that the output is std::numeric_limits<unsigned long>::max() - 9. However, The output of this code is 0 for all values n < 1024. Why?

Some observations of what I'm thinking to understand so far:

We do not exceed std::numeric_limits<unsigned long>::max() so no overflow of ul should happen (mathematically speaking).
Casting d while adding it results in the expected value for ul (change line 7 to ul += static_cast<unsigned long>(d);)
My guess at what happens:
- ul += d is resolved to ul = (double)ul + d
- This addition is executed as a 64-bit floating-point operation
- The resulting value can not be represented precisely by double and turns out to be std::numeric_limits<unsigned long>::max() + 1.
- This result is then cast back to unsigned long, which overflows/wraps around to 0.

EDIT

Some testing seems to support my guess above.

double x = std::numeric_limits<unsigned long>::max() results in x holding the value std::numeric_limits<unsigned long>::max() + 1.
Yes, my unsigned long is 64-bit.
The question is not why double is not precise. I understand the concept of floating-point numbers. The question is what are the exact rules for C++ for evaluating an expression in which data format that lead to this unfortunate result.

`double` can only approximate large 64bit integers so the result of your calculation is rounded before assigning back to `ul` — Alan Birtles, Apr 20 '21 at 12:23
https://stackoverflow.com/questions/759201/representing-integers-in-doubles — , Apr 20 '21 at 12:25
@Seriously Did you check the size of your `unsigned long`? 32 bits or 64 bits? — Damien, Apr 20 '21 at 12:25
@AlanBirtles is `ul = (double)ul + d;` what is going on? Not sure if the dupe alone is sufficient as answer. — 463035818_is_not_an_ai, Apr 20 '21 at 12:30
Your guess looks correct, except that casting back to integer type is undefined behavior if it isn't in the integer type's range. — interjay, Apr 20 '21 at 12:32
Side note: This question will work (as in give the mathematically correct result) on LLP64 systems (e.g. windows) as the maximum value of a unsigned long in that case is 2^32-1. And a double can perfectly represent integers in that range. — Mike Vine, Apr 20 '21 at 12:36
I think you can find the answer somewhere in here https://en.cppreference.com/w/c/language/conversion (that + the duplicate) — 463035818_is_not_an_ai, Apr 20 '21 at 12:37
@largest_prime_is_463035818 yep, its the cast to double and back that causes the problem: https://godbolt.org/z/GTvGqhT5q, note that enabling optimisations changes the result — Alan Birtles, Apr 20 '21 at 12:38
Note that `std::cout << (unsigned long) (double) ul - ul << std::endl; ` gives 11. At least on my PC (UB !) — Damien, Apr 20 '21 at 12:45

KamilCuk · Accepted Answer · 2021-05-02T17:38:08.750

The question is what are the exact rules for C++ for evaluating an expression in which data format that lead to this unfortunate result.

Let's inspect the line:

ul += d;

Where d has type double and ul has type unsigned long.

From 7.6.19 Assignment and compound assignment operators :

The behavior of an expression of the form E1 op= E2 is equivalent to E1 = E1 op E2 except that E1 is evaluated only once

So ul += d is equal to ul = ul + d.

From 7.6.6 Additive operators :

The additive operators + and - group left-to-right. The usual arithmetic conversions are performed for operands of arithmetic or enumeration type.

So both ul and d are promoted in ul + d.

From 7.4 Usual arithmetic conversions :

[...] This pattern is called the usual arithmetic conversions, which are defined as follows:

[...]

Otherwise, if either operand is double, the other shall be converted to double.

[...]

So ul is converted to double in ul + d.

From 7.3.11 Floating-integral conversions emphasis mine:

A prvalue of an integer type or of an unscoped enumeration type can be converted to a prvalue of a floating-point type. The result is exact if possible. If the value being converted is in the range of values that can be represented but the value cannot be represented exactly, it is an implementation-defined choice of either the next lower or higher representable value.

If the value being converted is outside the range of values that can be represented, the behavior is undefined.

So it is implementation defined if the value of ul can't be represented exactly in double which value is used.

And then, after calculation, the double result is converted back to unsigned long in assignment to ul, so also from Floating-integral conversions emphasis mine:

A prvalue of a floating-point type can be converted to a prvalue of an integer type. The conversion truncates; that is, the fractional part is discarded. The behavior is undefined if the truncated value cannot be represented in the destination type.

The output of this code is 0 for all values n < 1024. Why?

Gcc compiler documents that it follows C99 Annex F when converting floats to integers and back, see gcc11.1.0 docs implementation defined beavior 4.6 Floating point, but I see the result in C99 Annex F is unspecified, but a floating point exception is required to be raised. The following code with function copied from cppreference feexceptflag

#include <iostream>
#include <limits>
#include <cfenv>

void show_fe_exceptions(void)
{
    printf("current exceptions raised: ");
    if(fetestexcept(FE_DIVBYZERO))     printf(" FE_DIVBYZERO");
    if(fetestexcept(FE_INEXACT))       printf(" FE_INEXACT");
    if(fetestexcept(FE_INVALID))       printf(" FE_INVALID");
    if(fetestexcept(FE_OVERFLOW))      printf(" FE_OVERFLOW");
    if(fetestexcept(FE_UNDERFLOW))     printf(" FE_UNDERFLOW");
    if(fetestexcept(FE_ALL_EXCEPT)==0) printf(" none");
    printf("\n");
}

int main(int argc, char **argv) {
  unsigned long n = 10ul;
  unsigned long ul = std::numeric_limits<unsigned long>::max() - n;
  double d = 1.;
  show_fe_exceptions();
  ul += d;
  show_fe_exceptions();
  std::cout << ul << std::endl;
}

outputs on godbolt and confirms the exception is raised:

current exceptions raised:  none
current exceptions raised:  FE_INEXACT FE_INVALID
0

So a summary and answer to the question about the output being 0 would be "because it is unspecified behavior"? — Seriously, May 03 '21 at 17:29

C++ Addition of very large unsigned long and double

1 Answers1