Why conversion (unsigned long long)DBL_MAX (or FLT_MAX) causes raising of FE_INEXACT as well?

Question

Code (t1.c):

#include <stdio.h>
#include <float.h>
#include <fenv.h>

#if _MSC_VER
#pragma fenv_access (on)
#else
#pragma STDC FENV_ACCESS ON
#endif


void print_fpe()
{
    int fpe = fetestexcept(FE_ALL_EXCEPT);
    printf("current exceptions raised:");
    if (fpe & FE_DIVBYZERO)       printf(" FE_DIVBYZERO");
    if (fpe & FE_INEXACT)         printf(" FE_INEXACT");
    if (fpe & FE_INVALID)         printf(" FE_INVALID");
    if (fpe & FE_OVERFLOW)        printf(" FE_OVERFLOW");
    if (fpe & FE_UNDERFLOW)       printf(" FE_UNDERFLOW");
    if ((fpe & FE_ALL_EXCEPT)==0) printf(" none");
}

volatile double d = DBL_MAX;
volatile float f = FLT_MAX;
volatile signed long long ll;
volatile signed long l;
volatile signed int i;
volatile signed short s;
volatile signed char c;
volatile unsigned long long ull;
volatile unsigned long ul;
volatile unsigned int ui;
volatile unsigned short us;
volatile unsigned char uc;

#define TEST(dst, type, src)         \
    feclearexcept(FE_ALL_EXCEPT);    \
    dst = (type)(src);               \
    print_fpe();                     \
    printf(" line %u\n", __LINE__);

int main(void)
{
    TEST(ll, signed long long, d);
    TEST(l, signed long, d);
    TEST(i, signed int, d);
    TEST(s, signed short, d);
    TEST(c, signed char, d);
    TEST(ll, signed long long, f);
    TEST(l, signed long, f);
    TEST(i, signed int, f);
    TEST(s, signed short, f);
    TEST(c, signed char, f);
    TEST(ull, unsigned long long, d); // line 55
    TEST(ul, unsigned long, d);
    TEST(ui, unsigned int, d);
    TEST(us, unsigned short, d);
    TEST(uc, unsigned char, d);
    TEST(ull, unsigned long long, f); // line 60
    TEST(ul, unsigned long, f);
    TEST(ui, unsigned int, f);
    TEST(us, unsigned short, f);
    TEST(uc, unsigned char, f);
    return 0;
}

Invocations and results:

$ cl t1.c && t1
current exceptions raised: FE_INVALID line 45
current exceptions raised: FE_INVALID line 46
current exceptions raised: FE_INVALID line 47
current exceptions raised: FE_INVALID line 48
current exceptions raised: FE_INVALID line 49
current exceptions raised: FE_INVALID line 50
current exceptions raised: FE_INVALID line 51
current exceptions raised: FE_INVALID line 52
current exceptions raised: FE_INVALID line 53
current exceptions raised: FE_INVALID line 54
current exceptions raised: FE_INEXACT FE_INVALID line 55
current exceptions raised: FE_INVALID line 56
current exceptions raised: FE_INVALID line 57
current exceptions raised: FE_INVALID line 58
current exceptions raised: FE_INVALID line 59
current exceptions raised: FE_INEXACT FE_INVALID line 60
current exceptions raised: FE_INVALID line 61
current exceptions raised: FE_INVALID line 62
current exceptions raised: FE_INVALID line 63
current exceptions raised: FE_INVALID line 64

$ clang t1.c && ./a.exe
t1.c:8:14: warning: pragma STDC FENV_ACCESS ON is not supported, ignoring pragma [-Wunknown-pragmas]
#pragma STDC FENV_ACCESS ON
             ^
1 warning generated.
current exceptions raised: FE_INVALID line 45
current exceptions raised: FE_INVALID line 46
current exceptions raised: FE_INVALID line 47
current exceptions raised: FE_INVALID line 48
current exceptions raised: FE_INVALID line 49
current exceptions raised: FE_INVALID line 50
current exceptions raised: FE_INVALID line 51
current exceptions raised: FE_INVALID line 52
current exceptions raised: FE_INVALID line 53
current exceptions raised: FE_INVALID line 54
current exceptions raised: FE_INEXACT FE_INVALID line 55
current exceptions raised: FE_INEXACT FE_INVALID line 56
current exceptions raised: FE_INVALID line 57
current exceptions raised: FE_INVALID line 58
current exceptions raised: FE_INVALID line 59
current exceptions raised: FE_INEXACT FE_INVALID line 60
current exceptions raised: FE_INEXACT FE_INVALID line 61
current exceptions raised: FE_INVALID line 62
current exceptions raised: FE_INVALID line 63
current exceptions raised: FE_INVALID line 64

$ gcc t1.c && ./a.exe
current exceptions raised: FE_INVALID line 45
current exceptions raised: FE_INVALID line 46
current exceptions raised: FE_INVALID line 47
current exceptions raised: FE_INVALID line 48
current exceptions raised: FE_INVALID line 49
current exceptions raised: FE_INVALID line 50
current exceptions raised: FE_INVALID line 51
current exceptions raised: FE_INVALID line 52
current exceptions raised: FE_INVALID line 53
current exceptions raised: FE_INVALID line 54
current exceptions raised: FE_INEXACT FE_INVALID line 55
current exceptions raised: FE_INEXACT FE_INVALID line 56
current exceptions raised: FE_INVALID line 57
current exceptions raised: FE_INVALID line 58
current exceptions raised: FE_INVALID line 59
current exceptions raised: FE_INEXACT FE_INVALID line 60
current exceptions raised: FE_INEXACT FE_INVALID line 61
current exceptions raised: FE_INVALID line 62
current exceptions raised: FE_INVALID line 63
current exceptions raised: FE_INVALID line 64

Question: why conversion (unsigned long long)DBL_MAX (or FLT_MAX) causes raising of FE_INEXACT as well?

Type `unsigned long long` cannot hold either the integral part of `DBL_MAX` (or `FLT_MAX`), or their fractional part. — Weather Vane, Mar 01 '21 at 16:55
@WeatherVane: Neither can `unsigned long` or `unsigned` represent the integer part. Why the differences? (By the way, any integer type can represent the fractional part of `DBL_MAX` or `FLT_MAX`, as they are zero.) — Eric Postpischil, Mar 01 '21 at 17:05

Nate Eldredge · Accepted Answer · 2021-03-01T23:40:43.420

3

I suppose you're testing this on x86, since that's where I see the behavior you describe. Example. Here's the low-level explanation.

On x86-64, gcc, at least, does most floating-point to integer conversion with the cvttsd2si instruction, which converts a double-precision floating point number to a 32- or 64-bit signed integer, raising an "invalid" exception if the result is out of range. This instruction can be used to convert to any signed integer type, and also to unsigned integer types of 32 bits or lower - for instance, a conversion to unsigned 32-bit can be done by converting to signed 64-bit and discarding high bits.

But this does not work for conversion to unsigned 64-bit, since the input might be a number that doesn't fit in signed 64-bit but would fit in unsigned 64-bit, and x86 has no instruction to make that conversion directly. As such, some extra arithmetic is needed, and it's these additional instructions that produce the "inexact" exception. (Specifically, it does a subsd to subtract (double)LLONG_MAX from the input, which does indeed result in a loss of precision when the input is DBL_MAX.)

See Unsigned 64-bit to double conversion: why this algorithm from g++ for an example of the sorts of gymnastics that gcc does to do this as efficiently as possible.

Note that on x86-64 you actually see FP_INEXACT with conversion to unsigned long as well, since it's the same as unsigned long long. I get the exact behavior you observe on x86-32, where unsigned long long is the only 64-bit type to which this applies. The code in that case is a bit more complicated and I would leave it to you to read through the assembly if you are really interested.

By contrast, when I run this code on AArch64, all lines simply give FE_INVALID. That's because AArch64 does have a dedicated instruction to convert floating point to unsigned 64-bit (fcvtzu) and so there's no further arithmetic that could involve an inexact result.

edited Mar 01 '21 at 23:40

answered Mar 01 '21 at 23:07

Nate Eldredge

48,811
6
54
82

Thanks! 1) _these additional instructions that produce the "inexact" exception_: which exactly _additional instructions_? 2) FYI: The hardware I'm dealing with has `float-to-integer` instruction, which raises inexact exception only if `inexact && ! invalid`. I.e. it is not possible to have both `FE_INEXACT` and `FE_INVALID` raised. – pmor Mar 01 '21 at 23:19
@pmor: (1) It's a little simpler on x86-64, where the instruction raising "inexact" is a `subsd` of `(double)ULLONG_MAX` from `DBL_MAX`. See https://godbolt.org/z/GrfGdT, line 120 of the assembly. I'm afraid I somewhat mixed up x86-32 and x86-64 in my answer; the code on x86-32 is more complicated, due to the lack of 64-bit integer instructions, and uses x87 instructions. x86-64 behaves as I described but will also show `FP_INEXACT` on conversion to `unsigned long` (since it's unsigned 64-bit just like `unsigned long long`). – Nate Eldredge Mar 01 '21 at 23:26
@pmor: (2) What hardware is that exactly? I agree on x86 that the float-to-integer `cvttsd2si` instruction raises `invalid` and not `inexact`, but it is *followed* by `subsd` which raises the `inexact`, and since the exceptions weren't cleared in between, you see them both. – Nate Eldredge Mar 01 '21 at 23:28
@pmor: Incidentally, gcc has a `-fno-fp-int-builtin-inexact` option that claims to prevent an "inexact" exception on functions like `ceil` and `round`, but it doesn't affect the behavior on casts. – Nate Eldredge Mar 01 '21 at 23:38
@pmor: Sorry, it's actually `(double)LLONG_MAX`, I think. – Nate Eldredge Mar 01 '21 at 23:40
Thanks for the answers. (2) _What hardware is that exactly_: HW powered by Infineon TriCore. (3) Is raising of `FE_INEXACT` a defect of implementation(s) (gcc, clang, msvc)? Reason: the end user expects only `FE_INVALID` and is surprised by seeing unexpected `FE_INEXACT` (in addition to expected `FE_INVALID`). – pmor Mar 02 '21 at 21:26
@pmor: (2) Okay, so you should probably unaccept this answer as it is really irrelevant to your non-x86 machine. Maybe there is some similar phenomenon going on under the hood on TriCore, but if you want to know what it is, you'll have to ask someone else (state the architecture in the question, as well as your compiler/library with versions, and tag accordingly!) or else just read the disassembled code for yourself. – Nate Eldredge Mar 02 '21 at 22:30
@pmor (3) I'm not enough of a language lawyer to answer that. On a practical level, at least on a machine like x86, I suspect it would be very hard to come up with an implementation that *didn't* raise it, other than by drastically increasing the overhead. – Nate Eldredge Mar 02 '21 at 22:30

score 1 · Answer 2 · answered Mar 01 '21 at 22:48

1

The code (unsigned long long)DBL_MAX has undefined behaviour, as per C11 6.3.1.4:

When a finite value of real floating type is converted to an integer type other than _Bool, the fractional part is discarded (i.e., the value is truncated toward zero). If the value of the integral part cannot be represented by the integer type, the behavior is undefined

Since the behaviour is undefined, "anything can happen", i.e. the behaviour is not covered by the standard.

answered Mar 01 '21 at 22:48

M.M

138,810
21
208
365

Also in C17 same section. – chux - Reinstate Monica Mar 01 '21 at 22:56
1

When the C standard does not define a behavior, the behavior may be defined, partially or completely, by other standards (https://stackoverflow.com/a/65107366/1778275). In this case it is IEEE 754. However, IEEE 754 says: _When a numeric operand would convert to an integer outside the range of the destination format, the invalid operation exception shall be signaled if this situation cannot otherwise be indicated_. I.e. it mentions only _invalid operation exception_ and not _inexact exception_. Where this `FE_INEXACT` comes from? Why only for `unsigned long long`? – pmor Mar 01 '21 at 23:06

Why conversion (unsigned long long)DBL_MAX (or FLT_MAX) causes raising of FE_INEXACT as well?

2 Answers2

Linked