NAN propagation and IEEE 754 standard

Question

I am designing a new microprocessor instruction set (www.forwardcom.info) and I want to use NAN propagation for tracing errors. However, there are a number of oddities in the IEEE 754 floating point standard that prevent this.

First, the reason why I want to use NAN propagation rather than error trapping is that I have vector registers with variable length. If, for example, I have a float vector with 8 elements and I have 1/0 in the first element and 0/0 in the sixth element, then I get only one trap, but if I run the same program on a computer with half the vector length then I get two traps: one for infinity and one for NAN. I want the result to be independent of the vector length so I need to rely on the propagation of NAN and INF rather than trapping. The NAN and INF values will propagate through the calculations so that they can be checked in the final result. The NAN representation contains some bits called payload that can be used for information about the source of error.

However, there are two problems in the IEEE 754 floating point standard that prevent reliable propagation of NAN values.

The first problem is that the combination of two NANs with different payloads is just one of the two values. For example NAN1 + NAN2 gives NAN1. This violates the fundamental principle that a+b = b+a. The compiler can swap the operands so that you get different results on different compilers or with different optimization options. I prefer to get the bitwise OR combination of the two payloads. This will work if you have one bit for each error condition, but of course not if the payload contains more complex information (such as NAN boxing in languages with dynamic types). The standards committee actually discussed the OR'ing solution (see http://grouper.ieee.org/groups/754/email/msg01094.html). I don't know why they rejected this proposal.

The second problem is that the min and max functions do not propagate the NAN if only one of the inputs is a NAN. In other words, min(1,NAN) = 1. Reliable NAN propagation would of course require that min(1,NAN) = NAN. I have no idea why the standard says this.

In the new microprocessor system, named ForwardCom, I want to avoid these unfortunate quirks and specify that NAN1 + NAN2 = NAN1 | NAN2, and min(1,NAN) = NAN.

And now to my questions: First, do I need an option switch to change between strict IEEE conformance and the reliable NAN propagation? Quoting the standard:

Quiet NaNs should, by means left to the implementer’s discretion, afford retrospective diagnostic information inherited from invalid or unavailable data and results. To facilitate propagation of diagnostic information contained in NaNs, as much of that information as possible should be preserved in NaN results of operations.

Note that the standard says "should" here, where it has "shall" elsewhere. Does that mean that my deviation from the recommendation is permissible?

And the second question: I cannot find any examples where NAN propagation is actually used for tracing errors. Maybe this is because of the weaknesses in the standard. I want to define different payload bits for different error conditions, for example:

0/0, 0*∞, ∞/∞, modulo(1,0), modulo(∞,1), ∞-∞, and other errors involving infinity and division by zero.
sqrt(-1), log(-1), pow(-1,0.1), and other errors deriving from logarithms and powers.
asin(2) and other mathematical functions.
explicit assignment. This can be useful when a variable is initialized to a NAN.

There are plenty of vacant bits for user-defined error codes.

Has this been done before, or do I have to invent everything from scratch? Are there any problems that I have to consider (other than NAN boxing in certain languages)

So, you want NaN to act like a number, even though it's Not a Number? — Pete Becker, Feb 27 '18 at 14:49
Some related links: https://stackoverflow.com/questions/45174949/can-we-use-any-value-in-floating-point-for-customized-flags https://stackoverflow.com/questions/33967804/what-uses-do-floating-point-nan-payloads-have https://stackoverflow.com/questions/1565164/what-is-the-rationale-for-all-comparisons-returning-false-for-ieee754-nan-values — A Fog, Feb 27 '18 at 14:50
@AFog: "*I don't know why they rejected this proposal.*" Why would they accept it? NaN doesn't exist to transmit "more complex information"; it exists to say one simple thing: that didn't work. Generally speaking, users are not trying to use NaN to carry error codes. — Nicol Bolas, Feb 27 '18 at 16:25
@AFog: "*The second problem is that the min and max functions do not propagate the NAN if only one of the inputs is a NAN.*" How could they be expected to do so? If you do `if(x > y) return x; else return y;`, if `x` is 1 and `y` is NaN, the result will be `x`. That is what `min` does, so there's no reason to expect it to propagate NaN. It would be strange indeed if the spelled-out version had different behavior from the intrinsic. — Nicol Bolas, Feb 27 '18 at 16:27
Any comparson involving NAN returns false. Even x==x returns false when x is NAN. min(x,y) can be implemented as min(x,y) = x < y ? x : y This will return y if any of the inputs is NAN, so min(1,NAN) = NAN, and min(NAN,1) = 1. This is illogical. We would expect min(x,y) and min(y,x) to be the same — A Fog, Feb 27 '18 at 16:38
@NicolBolas: NaN does not exist solely to record an invalid operation. Payloads are used to convey information by some users, and the IEEE 754 did and does consider that in its deliberations. — Eric Postpischil, Feb 27 '18 at 17:11
@AFog Apple's [SANE](https://en.wikipedia.org/wiki/Standard_Apple_Numerics_Environment) used NaN payloads to indicate the specific invalid operation from which the NaN originated. I *think* (not sure) the original ARM floating-point unit did as well. Having worked on floating-point processors and libraries in a professional capacity for many years, I'd say that NaN payloads are rarely, if ever, used in practice. Therefore IEEE-754 allows for other solutions, such as the canonical NaN used for single-precision arithmetic on NVIDIA GPUs. Use of a canonical NaN simplifies hardware and software. — njuffa, Feb 27 '18 at 17:19
@NicolBolas: It is not generally a goal of the IEEE 754 committee to make an “intrinsic” out of “spelled out” code. It is certainly a goal to standardize useful operations in beficial ways. However, in this case, the “spelled out” code is not informative as `min` might be formed from `x < y ? x : y` or `x > y ? y : x`. These would produce different results for x = 1, y = NaN, and choosing one would be arbitrary. It would be preferable if the operation were commutative, making it not subject to changes when order of computation changes. — Eric Postpischil, Feb 27 '18 at 17:29
@AFog: The current draft for the next IEEE 754 revision contains both NaN-favoring `minimum` and `maximum` and number-favoring `minimumNumber` and `maximumNumber`. This means an application would be able to choose what suits it, but your instruction set would have to support both if you intend it to provide conformance. — Eric Postpischil, Feb 27 '18 at 17:32
The first step would be to define how to print the payload, which, IIRC, is not defined. — chux - Reinstate Monica, Feb 27 '18 at 19:32
"Reliable NAN propagation would of course require that min(1,NAN) = NAN" --> Do you also expect `min(NAN, 1) --> NAN` (communicative?) and what to expect for `min(NAN1, NAN2)`? IMO, `foo(x,y)` should nominally return `x` if `isNAN(x) && !isNAN(y)`, return `y` if `isNAN(y) && !isNAN(x)`, and return `NANxy` when both are same `NANxy` and return TBD when both are NAN, yet differ. — chux - Reinstate Monica, Feb 27 '18 at 19:54
@AFog: The working drafts are [here](http://754r.ucbtest.org/drafts/). — Eric Postpischil, Feb 28 '18 at 22:24
NaNs are definitely better than undefs in this aspect. In Haskell, for instance, with all its elaborate type system, `undefined && False` is an error, while `False && undefined` is False. — bipll, Feb 28 '18 at 23:31

score 6 · Answer 1 · edited Oct 08 '20 at 11:17

Yes, you are allowed to deviate from the "should"s. From the spec (§1.6):

― may indicates a course of action permissible within the limits of the standard with no implied preference (“may” means “is permitted to”)

― shall indicates mandatory requirements strictly to be followed in order to conform to the standard and from which no deviation is permitted (“shall” means “is required to”)

― should indicates that among several possibilities, one is recommended as particularly suitable, without mentioning or excluding others; or that a certain course of action is preferred but not necessarily required; or that (in the negative form) a certain course of action is deprecated but not prohibited (“should” means “is recommended to”).

Regarding the behaviour of min, the Intel implementation also differs from the IEEE spec. From the Intel instruction set reference for MINSD:

If a value in the second source operand is an SNaN, then SNaN is returned unchanged to the destination (that is, a QNaN version of the SNaN is not returned).

If only one value is a NaN (SNaN or QNaN) for this instruction, the second source operand, either a NaN or a valid floating-point value, is written to the result. If instead of this behavior, it is required that the NaN source operand (from either the first or second source) be returned, the action of MINSD can be emulated using a sequence of instructions, such as, a comparison followed by AND, ANDN and OR.

In other words, it corresponds to x < y ? x : y. (See Argument order to std::min changes compiler output for floating-point for more details: this is C++ std::min, not the C math library fmin that wraps the IEEE-754 NaN-propagating minimum operation.)

I'm not actually sure what particular sequence they have in mind, but there is an alternative approach suggested here https://github.com/JuliaLang/julia/issues/7866#issuecomment-51845730.

Thanks for the reference to the Julia discussion. It has many relevant points — A Fog, Mar 01 '18 at 07:09
My pleasure! Thanks for the excellent resources you provide on your website, I have found them very valuable over the years. — Simon Byrne, Mar 01 '18 at 17:25
For people used to C and/or x86 terminology, IEEE `minimum` (NaN-propagating) is like C math library `fmin`. x86 `minsd` and so on implement C++ `std::min`, both of which exactly implement `x — Peter Cordes, Oct 08 '20 at 11:14

score 4 · Answer 2 · answered Feb 28 '18 at 23:12

Some thoughts:

The second problem is that the min and max functions do not propagate the NAN if only one of the inputs is a NAN. In other words, min(1,NAN) = 1. Reliable NAN propagation would of course require that min(1,NAN) = NAN. I have no idea why the standard says this.

The current draft for the next IEEE 754 revision contains both NaN-favoring minimum and maximum and number-favoring minimumNumber and maximumNumber. This means an application would be able to choose what suits it, but your instruction set would have to support both if you intend it to provide conformance. (Note “support” rather than “implement.” An instruction set does not need to directly implement IEEE 754 operations in individual instructions in order to enable a computing platform to conform to IEEE 754—it just needs to provide instructions from which a conforming platform can be constructed. It is okay if an IEEE 754 operation requires multiple instructions or support from the operating system or libraries.)

And now to my questions: First, do I need an option switch to change between strict IEEE conformance and the reliable NAN propagation?

Since what NaN you return is only a “should” in the standard, you do not need to return the recommended NaN to claim conformance. However, minimum(1, NaN) must return a NaN.

Of course, you do not have to do that via a switch, and environmental state is disfavored due to its drag on performance. Selecting between behaviors could be done with different instructions or different inputs to the instructions via an additional register or additional bits accompanying the normal register contents.

And the second question: I cannot find any examples where NAN propagation is actually used for tracing errors.

I recall at least one IEEE 754 committee member making use of NaN payloads, but I do not recall who or the details.

score 2 · Answer 3 · answered Mar 02 '18 at 08:52

Regarding the addition of two NANs. When you add two NANs with differnt payloads you just get one of them, usually the first one. This makes a+b different from b+a which is unacceptable because the compiler may swap the operands. Above, I proposed to return the bitwise OR combination of the two payloads. Thinking about it, there is another possible solution: Return the biggest of the two payloads.

The 'OR' solution has the advantage is that it is simple. The disadvantage is that it limits the useful information you can have in the payload to one bit for each possible error condition. It would still be quite useful, though, because the number of different events that can generate a NaN is less than the number of payload bits.

The second solution where you return the biggest of the two payloads requires slightly more hardware. The advantage is that you can have more detailed information in the payload, perhaps including information about where the fault occurred. The disadvantage is that you only propagate information about the worst of two faults. This solution is completely compatible with the current standard. New processors can implement this without needing a switch for backward compatibility.

I like the idea of returning the biggest payload as a way of making it symmetric. However, some applications may still prefer the bitwise OR, or return the first one (asymmetric). Perhaps it should be configurable, just like the rounding modes. — Yakov Galka, May 13 '22 at 16:01

score 1 · Answer 4 · answered Mar 02 '18 at 13:35

Just to add to this discussion, the ieee standard explicitly allows for that error encoding flexibility in Nan, but indicates that it’s to be done by programming language implementations rather than at the hardware layer. That said : I do like the point about supporting the nan poisoning semantics with bitwise or semantics at the hardware level. I’ve been exploring adding this same semantics to the the ghc Haskell compiler.

That said, I do think the trapping semantics / signaling semantics would still be useful to provide. In many programming languages/programs, the set of enabled traps can be treated as abortive exceptions in the underlying computation. This means the platform varying issue of whether one vs two errors were reported in tandem doesn’t change the “meaning” of the local computation. (And in fact it could be argued that a lot of high level programming languages would benefit from having support for treating signaling nans as exceptions. Which seems to be largely lacking )

NAN propagation and IEEE 754 standard

4 Answers4