Can the floating-point status flag FE_UNDERFLOW set when the result is not sub-normal?

Question

While investigating floating-point exception status flags, I came across the curious case of a status flag FE_UNDERFLOW set when not expected.

This is similar to When does underflow occur? yet goes into a corner case that may be a C specification issue or FP hardware defect.

// pseudo code
//                       s bias_expo implied "mantissa"
w = smallest_normal;  // 0 000...001 (1)     000...000
x = w * 2;            // 0 000...010 (1)     000...000
y = next_smaller(x);  // 0 000...001 (1)     111...111
round_mode(FE_TONEAREST);
clear_status_flags();
z = y/2;              // 0 000...001 (1)     000...000

FE_UNDERFLOW is set!?

I did not expect FE_UNDERFLOW to be set as z above is normal, not a sub-normal.
I expect FE_UNDERFLOW when the result of an earlier floating-point operation was subnormal with a loss of precision. In this case there is a loss of precision.

I tried this with my float and long double and had the same result.

After much investigation, I noted that __STDC_IEC_559__ is not defined.

Questions

If __STDC_IEC_559__ is defined, what is the correct state of underflow in this case?
With lack of a defined __STDC_IEC_559__ am I stuck with "Implementations that do not define __STDC_IEC_559__ are not required to conform to these specifications." C11 or is there some C specification that indicates this result is incorrect?
Since this is certainly a result of my hardware (processor), your result may differ and that would be interesting to know.

Follows is some test code that demos this. At first I suspected it is because FLT_EVAL_METHOD = 2 on my machine, yet then I tried similar code with long double and the same result.

// These 2 includes missing in original post, yet in my true test code
#include <float.h>
#include <math.h>

#include <fenv.h>
#include <stdio.h>
#include <stdint.h>

#define N (sizeof excepts/sizeof excepts[0])
void Report_IEC_FP_exception_status_flags(const char *s) {
  printf("%s", s);
  int excepts[] = { //
      FE_DIVBYZERO, FE_INEXACT, FE_INVALID, FE_OVERFLOW, FE_UNDERFLOW, };
  const char *excepts_str[N] = { //
      "FE_DIVBYZERO", "FE_INEXACT", "FE_INVALID", "FE_OVERFLOW", "FE_UNDERFLOW", };
  int excepts_val[N];

  for (unsigned i = 0; i < N; i++) {
    excepts_val[i] = fetestexcept(excepts[i]);
  }
  for (unsigned i = 0; i < N; i++) {
    if (excepts_val[i]) printf(" %s", excepts_str[i]);
  }
  printf("\n");
  fflush(stdout);
}
#undef N

void test2(float f, int round_mode, const char *name) {
  union {
    float f;
    uint32_t u32;
    } x = { .f = f};

  printf("x:%+17a %08lX normal:%c round_mode:%d %s\n", //
  f, (unsigned long) x.u32, isnormal(f) ? 'Y' : 'n', round_mode, name);
  if (feclearexcept(FE_ALL_EXCEPT)) puts("Clear Fail");
  Report_IEC_FP_exception_status_flags("Before:");

  f /= 2;

  Report_IEC_FP_exception_status_flags("After :");
  printf("y:%+17a %08lX normal:%c\n\n", 
      f,(unsigned long) x.u32, isnormal(f) ? 'Y' : 'n');
}

Driver

// In same file as above
int main(void) {
  #ifdef __STDC_IEC_559__
    printf("__STDC_IEC_559__ = %d\n", __STDC_IEC_559__);
  #else
    printf("__STDC_IEC_559__ = not define\n");
  #endif

  float f = FLT_MIN;
  printf("FLT_EVAL_METHOD = %d\n", FLT_EVAL_METHOD);
  printf("FLT_MIN:%+17a\n", f);
  f *= 2.0f;
  test2(f, FE_TONEAREST, "FE_TONEAREST");
  f = nextafterf(f, 0);
  test2(f, FE_TONEAREST, "FE_TONEAREST");   // *** problem? ***
  f = nextafterf(f, 0);
  test2(f, FE_TONEAREST, "FE_TONEAREST");
}

Output

__STDC_IEC_559__ = not define
FLT_EVAL_METHOD = 2
FLT_MIN:        +0x1p-126
x:        +0x1p-125 01000000 normal:Y round_mode:0 FE_TONEAREST
Before:
After :
y:        +0x1p-126 01000000 normal:Y

x: +0x1.fffffep-126 00FFFFFF normal:Y round_mode:0 FE_TONEAREST
Before:
After : FE_INEXACT FE_UNDERFLOW                *** Why FE_UNDERFLOW? ***
y:        +0x1p-126 00FFFFFF normal:Y          *** Result is normal  ***

x: +0x1.fffffcp-126 00FFFFFE normal:Y round_mode:0 FE_TONEAREST
Before:
After :
y: +0x1.fffffcp-127 00FFFFFE normal:n

Ref

IEEE_754

Implementation notes:

GNU C11 (GCC) version 6.4.0 (i686-pc-cygwin) compiled by GNU C version 6.4.0, GMP version 6.1.2, MPFR version 3.1.5-p10, MPC version 1.0.3, isl version 0.14 or 0.13

glibc 2.26 released.

Intel Xeon W3530, 64-bit OS (Windows 7)

[Minor Update] The illustrative print of the quotient as an 32-bit hex number should have used y.u32. This does not change the function under test

// printf("y:%+17a %08lX normal:%c\n\n", 
//    f,(unsigned long) x.u32, isnormal(f) ? 'Y' : 'n');
union {
  float f;
  uint32_t u32;
} y = { .f = f};
printf("y:%+17a %08lX normal:%c\n\n", 
    f,(unsigned long) y.u32, isnormal(f) ? 'Y' : 'n');
//                    ^^^^^

@EOF notes appended to post. Sorry just one platform so far. Yet a good idea for others - I'll try another later. — chux - Reinstate Monica, Oct 06 '17 at 18:19
As far as I am aware, if `__STDC_IEC_559__` is not defined then you are indeed stuck without any specifications for details of FP behavior, except any that your implementation itself may provide. — John Bollinger, Oct 06 '17 at 18:27
I get this on my Raspberry Pi: `__STDC_IEC_559__ = 1 FLT_EVAL_METHOD = 0 FLT_MIN: +0x1p-126 x: +0x1p-125 01000000 normal:Y round_mode:0 FE_TONEAREST Before: After : y: +0x1p-126 01000000 normal:Y x: +0x1.fffffep-126 00FFFFFF normal:Y round_mode:0 FE_TONEAREST Before: After : FE_INEXACT FE_UNDERFLOW y: +0x1p-126 00FFFFFF normal:Y x: +0x1.fffffcp-126 00FFFFFE normal:Y round_mode:0 FE_TONEAREST Before: After : y: +0x1.fffffcp-127 00FFFFFE normal:n` So the same behavior *with* `__STDC_IEC_559__`. — EOF, Oct 06 '17 at 18:27
@eof, thanks- glad you could run it. Yours reports appears to only differ in `__STDC_IEC_559__ = 1 FLT_EVAL_METHOD = 0` - hmmmm. — chux - Reinstate Monica, Oct 06 '17 at 18:34
Interesting. To compile your code successfully and without warnings on gcc 4.8.5 (Linux / x86_64), I need to add headers `float.h` and `math.h`. Having done so, I find that `__STDC_IEC_559__` **is** defined (to 1), and that `FLT_EVAL_METHOD` is defined to 0, according to the program's output. Otherwise, however, I observe the same output presented in the question. — John Bollinger, Oct 06 '17 at 18:39
@EOF Consider trying `long double` math by adding `long double fld = (1.0L - LDBL_EPSILON/2)*(LDBL_MIN * 2.0L);` at the beginning of `test2()` and replace `f /= 2;` with `fld /= 2;` Same `FE_UNDERFLOW`? — chux - Reinstate Monica, Oct 06 '17 at 18:41
@JohnBollinger I also did have those 2 `#include #include ` and omitted in post -post updated. — chux - Reinstate Monica, Oct 06 '17 at 18:43
It is conceivable that your more recent GCC and/or glibc declines to define `__STDC_IEC_559__` precisely because of small variances from the requirements of Annex F such as the behavior you observe seems to be. However, having only a Wikipedia summary of IEEE-754 to go by, I cannot be certain whether the behavior you observe in fact fails to conform. — John Bollinger, Oct 06 '17 at 18:44
@JohnBollinger I also suspect many platforms are nearly _IEEE-754_, yet not quite 100%. — chux - Reinstate Monica, Oct 06 '17 at 18:47
@chux You're also missing `#include`s for `main()`. Also, `long double` seems to report `FE_INEXACT` and `FE_UNDERFLOW` too (assuming no optimizations to avoid the computation alltogether), but I haven't checked whether `long double` is different from `double` on the Raspian gcc. — EOF, Oct 06 '17 at 18:48
My `long double` is an [80-bit](https://en.wikipedia.org/wiki/Extended_precision#x86_extended_precision_format) one. — chux - Reinstate Monica, Oct 06 '17 at 18:53
@chux I've just checked, `gcc version 4.9.2 (Raspbian 4.9.2-10)` for ARMv7 with vfpv4 and NEON has 8-byte `long double`, same as `double`. — EOF, Oct 06 '17 at 18:56
Perhaps footnote 371 of C 2011 is relevant: "IEC 60559 allows different definitions of underflow. They all result in the same values, but differ on when the floating-point exception is raised." — John Bollinger, Oct 06 '17 at 19:14
It looks like the implementation can choose to [detect underflow before or after rounding](http://grouper.ieee.org/groups/754/email/msg04221.html), so the result is correct. — nwellnhof, Oct 06 '17 at 19:16
@nwellnhof "The implementor may choose how these events are detected, but shall detect these events in the same way for all operations." looks reasonable, but is that choice of before/after underflow detection traceable to an IEC 60559 specification as reported by C11 or a sage interpretation? It seems to be in "The relevant clause in the 2008 standard is 7.5." - Guess I'll need to buy the spec. — chux - Reinstate Monica, Oct 06 '17 at 21:17
@nwellnhof [Two definitions were allowed for the determination of the 'tiny' condition: before or after rounding ...](https://en.wikipedia.org/wiki/IEEE_754_revision) supports your [good reference](https://stackoverflow.com/questions/46611633/can-the-floating-point-status-flag-fe-underflow-set-when-the-result-is-not-sub-n#comment80177104_46611633) and goes further with "_recommended_ that only tininess after rounding..." — chux - Reinstate Monica, Oct 06 '17 at 21:26

chux - Reinstate Monica · Accepted Answer · 2021-03-23T22:38:40.227

Although not intended as a self answer, input from various commenters @John Bollinger, @nwellnhof and further research leads to:

Can the floating-point status flag FE_UNDERFLOW set when the result is not sub-normal?

Yes, in narrow situations - when the math answer lies between sub-normals and normals. See following.

"Underflow" occurs when:

The result underflows if the magnitude of the mathematical result is so small that the mathematical result cannot be represented, without extraordinary roundoff error, in an object of the specified type. C11 7.12.1 Treatment of error conditions

The z = y/2; above is 1) inexact (due to rounding) and 2) maybe considered "too small".

Math

The z = y/2; can be thought of going through 2 stages: dividing and rounding. The mathematical quotient, with unlimited precision, is less than the smallest normal number FLT_MIN and more than the greatest sub-normal number nextafterf(FLT_MIN,0). Depending on rounding mode, the final answer is either one of those two. With FE_TONEAREST, z is assigned FLT_MIN, a normal number.

Spec

The C spec below and to IEC 60559 indicate

The "underflow" floating-point exception is raised whenever a result is tiny (essentially subnormal or zero) and suffers loss of accuracy.³⁵⁸ C11 §F.10 7.
³⁵⁸ IEC 60559 allows different definitions of underflow. They all result in the same values, but differ on when the floating-point exception is raised.

and

Two definitions were allowed for the determination of the 'tiny' condition: before or after rounding the infinitely precise result to working precision, with unbounded exponent.

Annex U of 754r recommended that only tininess after rounding and inexact as loss of accuracy be a cause for underflow signal. wiki reference

(My emphasis)

Q & A

If STDC_IEC_559 is defined, what is the correct state of underflow in this case?

The underflow flag may be set or left alone in this case. Either complies. There is a preference though, for not setting the underflow flag.

2 With lack of a defined STDC_IEC_559 am I stuck with "Implementations that do not define STDC_IEC_559 are not required to conform to these specifications." C11 or is there some C specification that indicates this result is incorrect?

The setting of the underflow flag result in not incorrect. The FP spec allows this behavior. It also allows to not set the underflow flag.

3 Since this is certainly a result of my hardware (processor), your result may differ and that would be interesting to know.

On another platform, where __STDC_IEC_559__ = not define and FLT_EVAL_METHOD = 0, the FE_INEXACT FE_UNDERFLOW flags were both set, just like in the above tst case. The issue applies to float, double, long double.

If the mathematical answer lies in the grey "Between" zone below, it will get rounding down to a sub-normal double or up to the normal double DBL_MIN depending on its value and rounding mode. If rounded down, then FE_UNDERFLOW is certainly set. If rounded up, then FE_UNDERFLOW may be set or not depending on when determination of the 'tiny' condition is applied.

I look forward to other answers, especially one that cites the floating point spec. — chux - Reinstate Monica, Oct 07 '17 at 12:24
Relevant, worth reading: http://www.jhauser.us/arithmetic/SoftFloat-3/doc/SoftFloat-FAQ.html (scroll down to "I discovered a case where, even though tininess..."). Extra (for better understanding): https://stackoverflow.com/a/67127805/1778275. Also: I think that for subnormals "more dynamic range" is not worth of all the implementation issues w.r.t. subnormal numbers. Maybe it was better to use `±(1.F) × 2^-127` format over `±(0.F) × 2^-126` format? — pmor, Jun 23 '21 at 11:34
@pmor Hmmm, " to use ±(1.F) × 2^-127 format over ±(0.F) × 2^-126 format" --> appears to lose the property of of 2 small FP values `a, b` where `a != b`, then `a-b` does not round 0.0 using ±(0.F) × 2^-126. Yet may round to 0.0 with ±(1.F) × 2^-127. That is a prime reason for sub-normals. — chux - Reinstate Monica, Jun 23 '21 at 11:49
Thanks! Yes, I remember that William Kahan once wrote about such `a != b => a-b != 0.0` case for small FP values. Because of that the rationale of using ±(0.F) × 2^-126 format (more dynamic range despite of less precision) is clear. — pmor, Jun 23 '21 at 16:33

Can the floating-point status flag FE_UNDERFLOW set when the result is not sub-normal?

1 Answers1

Linked