What exactly does (1.0e300 + pow(2.0, -30.0) > 1.0) do in STDC?

Question

I have come across a function which computes atan(x) (the source is here). Reducing it to the core of my question and slightly reformatting it, they have something like that:

static const double one   = 1.0,
                   huge   = 1.0e300;

double atan(double x)
{
  /* A lot of uninteresting stuff here */

  if (ix < 0x3fdc0000) {              /* |x| < 0.4375 */

    if (ix < 0x3e200000) {            /* |x| < 2^-29 */
      if ((huge + x) > one) return x; /* raise inexact */
    }

    id = -1;
  }

  /* A lot of more uninteresting stuff */
}

I would be very interested in what the line if ((huge + x) ... is supposed to do and how it works.

According to the comments, if the absolute value of x is smaller than 2^-29, the expression or the comparison raises an inexact error.

My first problem is that I currently don't get why this is done that way: If computing the arctan using that function leads to imprecise results if the absolute value of x is too small, why don't they just use something like if (fabs(x) < [some_value_here]) ...? I suspect that this is just because the inexact warning wouldn't be raised that way in their hardware / library, but I'd like to know for sure.

Assuming that I am right, my second problem is that I don't understand why the comparison is needed. I think that the key point here is that a very small number is added to a very large one, so that this addition does not change the large one sufficiently or even not at all. Hence, it is the addition which will raise the inexact warning, not the comparison. So I am asking myself what the comparison should do. Is this just to force the compiler to actually compute (huge + x), which might be optimized out otherwise?

Finally, I would be grateful if somebody could explain the math a little bit. Choosing the value 1.0e300 for huge seems a pretty arbitrary choice. But this is only a bonus question since I admit I haven't done the math part of my homework yet (I am not a total newbie regarding double values and their IEEE754 representation, but understanding the mathematical aspects of this code will take me some time, unless somebody gives a short explanation).

EDIT 1

Just seen by accident:

The float32 version of that function, including the weird line discussed above, is nearly literally still in glibc 2.19! Since glibc is meant to be portable, that code should be as well. It is in the sub-directory sysdeps\ieee754\flt-32, so I suppose this is the software emulation of the float32 functions, where portability is no issue because hardware-dependent oddities won't show up (I think the software emulation raises those exceptions exactly as defined in IEEE754).

The title isn't consistent with the code. `2^-30` in the title suggests bitwise exclusive-or of 2 and -30 (though that's not the case because of C++'s oft-surprising operator precedence). :-) — Adrian McCarthy, Feb 19 '19 at 19:48

score 8 · Accepted Answer · answered Feb 19 '19 at 19:10

The intent of if ((huge + x) > one) return x; is to generate a floating-point inexact exception and then return from the routine.

A floating-point exception is not a trap or processor exception. It just means something unusual has happened in a floating-point operation. What happens then depends on the circumstances of the operation. In particular, the floating-point environment might be set so that an inexact exception merely raises a flag in a special register and continues with the operation, delivering a numerical result. Or it might be set so that an inexact exception causes a trap, and program control is redirected to a trap handler.

This code implementing atan does not know how the floating-point environment is set. Maybe it could fetch the settings, but it does not want to bother with that. Given that it has decided the arctangent function cannot be computed exactly, the easiest way for it to trigger a floating-point inexact exception is merely to perform a simple addition that has an inexact result. This inexact addition will have the same behavior that is desired for an inexact arctangent—it will either just raise the flag or cause a trap, depending on the settings.

As to why comparisons are made with ix < 0x3e200000, that is not clear. For one thing, ix has been adjusted to reflect the absolute value, whereas x has not, so why not use the already prepared ix rather than using another operation to produce fabs(x)? Additionally, an integer compare typically takes less processor resources than a floating-point compare, especially in the processors of the time when this code was written. Or it may be the author simply happened to use one over the author, perhaps writing most of their code using ix to operate on the floating-point encoding and not x to operate on the floating-point value, and they did not want to switch back and forth unnecessarily. It could also be because the code was written before the hexadecimal floating-point notation was available (so we could write x < 0x1p-29f), and compilers were not good about converting decimal numerals to floating-point values, so they did not want to write out the floating-point value in the source code.

This sort of code is problematic and is highly dependent on the C implementation it is written for. In general, there may not be a guarantee from the compiler that (huge + x) > one will be evaluated during program execution. The compiler might evaluate it at compile time. Presumably, though, if this code is written for a particular C implementation, they know the compiler will either evaluate it at compile time or will ensure the same result is achieved, including raising of the floating-point inexact exception.

On the face of it, (huge + x) > one does not seem to do anything useful that huge + x alone does not do, but perhaps the author knew something about the compiler that we do not.

huge does not need to be 1.0e300. Any value so large that the sum of huge and x cannot be exact would suffice.

Thanks for the explanation so far! One thing: *[...] Additionally, an integer compare typically takes less processor resources than a floating-point compare [...]* Yes, I've got that they first check against `2^-29`, and I understand the way they do it. But the following code puzzles me. Obviously, they build and compare `huge + x` only if `x < 2^-29` (which I eventually will understand some day), but also obviously, for any given `x`, the choice of `huge` will influence whether `inexact` is triggered or not (e.g. `1.0e300` vs. `1.0e307`). So how did they derive the value `1.0e300`? — Binarus, Feb 19 '19 at 19:42
@Binarus: Since `x` is less than 2^-29, its highest bit is less than 2^-30 or lower. The `double` format this code was written for has only 53 bits in the significand. So if the highest bit of a number is 2^23, the lowest bit in its significand could be at most 2^(23-52) = 2^-29. Therefore, if any number 2^23 or greater is added to a positive number less than 2^-29, the result cannot be exactly represented, so an inexact exception occurs. However, at this point, x could be negative, in which case the sum could come down slightly from 2^23 and have another bit available. Using 2^24 avoids that. — Eric Postpischil, Feb 19 '19 at 19:59
One more interesting thing (just saw it by accident): The `float32` version of that function, including the trick discussed in my question, is nearly literally still in glibc 2.19! Since glibc is meant to be portable, the code should be as well. It is in the sub-directory `sysdeps\ieee754\flt-32`, so I suppose this is the software emulation of the `float32` functions, where portability is no issue (I think the software emulation raises those exceptions exactly as defined in IEEE754). I'll add this finding to my question to emphasize that this trick is still used. — Binarus, Feb 19 '19 at 19:59
So, adding any number 2^24 or greater to x would cause an inexact exception. It does not need to be 1e300. — Eric Postpischil, Feb 19 '19 at 19:59
Fantastic explanation in your comments, thank you very much! Accepted and +1. — Binarus, Feb 19 '19 at 20:01

What exactly does (1.0e300 + pow(2.0, -30.0) > 1.0) do in STDC?

1 Answers1