Is my fma() broken?

Question

In using double fma(double x, double y, double z); I'd expect a non-zero d in the output lines below marked with '?'. It appears to internally only use long double precision rather than infinite precision as specified.

The fma functions compute (x × y) + z, rounded as one ternary operation: they compute the value (as if) to infinite precision and round once to the result format, according to the current rounding mode. §7.12.13.1 2 (my emphasis)

So is my fma() broken, or how am I using it incorrectly in code or compile options?

#include <float.h>
#include <math.h>
#include <stdio.h>

int main(void) {
  // Invoking: Cygwin C Compiler
  // gcc -std=c11 -O0 -g3 -pedantic -Wall -Wextra -Wconversion -c -fmessage-length=0 
  //   -v -MMD -MP -MF"x.d" -MT"x.o" -o "x.o" "../x.c"

  printf("FLT_EVAL_METHOD %d\n", FLT_EVAL_METHOD);
  for (unsigned i = 20; i < 55; i++) {
    volatile double a = 1.0 + 1.0 / pow(2, i);
    volatile double b = a;
    volatile double c = a * b;
    volatile double d = fma(a, b, -c);
    volatile char *nz = ((i >= 27 && a != 1.0) == !d) ? "?" : "";
    printf("i:%2u a:%21.13a c:%21.13a d:%10a %s\n", i, a, c, d, nz);
  }
  return 0;
}

Output

FLT_EVAL_METHOD 2
i:20 a: 0x1.0000100000000p+0 c: 0x1.0000200001000p+0 d:    0x0p+0 
i:21 a: 0x1.0000080000000p+0 c: 0x1.0000100000400p+0 d:    0x0p+0 
i:22 a: 0x1.0000040000000p+0 c: 0x1.0000080000100p+0 d:    0x0p+0 
i:23 a: 0x1.0000020000000p+0 c: 0x1.0000040000040p+0 d:    0x0p+0 
i:24 a: 0x1.0000010000000p+0 c: 0x1.0000020000010p+0 d:    0x0p+0 
i:25 a: 0x1.0000008000000p+0 c: 0x1.0000010000004p+0 d:    0x0p+0 
i:26 a: 0x1.0000004000000p+0 c: 0x1.0000008000001p+0 d:    0x0p+0 
i:27 a: 0x1.0000002000000p+0 c: 0x1.0000004000000p+0 d:   0x1p-54 
i:28 a: 0x1.0000001000000p+0 c: 0x1.0000002000000p+0 d:   0x1p-56 
i:29 a: 0x1.0000000800000p+0 c: 0x1.0000001000000p+0 d:   0x1p-58 
i:30 a: 0x1.0000000400000p+0 c: 0x1.0000000800000p+0 d:   0x1p-60 
i:31 a: 0x1.0000000200000p+0 c: 0x1.0000000400000p+0 d:   0x1p-62 
i:32 a: 0x1.0000000100000p+0 c: 0x1.0000000200000p+0 d:    0x0p+0 ?
i:33 a: 0x1.0000000080000p+0 c: 0x1.0000000100000p+0 d:    0x0p+0 ?
i:34 a: 0x1.0000000040000p+0 c: 0x1.0000000080000p+0 d:    0x0p+0 ?
...
i:51 a: 0x1.0000000000002p+0 c: 0x1.0000000000004p+0 d:    0x0p+0 ?
i:52 a: 0x1.0000000000001p+0 c: 0x1.0000000000002p+0 d:    0x0p+0 ?
i:53 a: 0x1.0000000000000p+0 c: 0x1.0000000000000p+0 d:    0x0p+0 
i:54 a: 0x1.0000000000000p+0 c: 0x1.0000000000000p+0 d:    0x0p+0

Version info

gcc -v

Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/i686-pc-cygwin/5.3.0/lto-wrapper.exe
Target: i686-pc-cygwin
Configured with: /cygdrive/i/szsz/tmpp/gcc/gcc-5.3.0-5.i686/src/gcc-5.3.0/configure --srcdir=/cygdrive/i/szsz/tmpp/gcc/gcc-5.3.0-5.i686/src/gcc-5.3.0 --prefix=/usr --exec-prefix=/usr --localstatedir=/var --sysconfdir=/etc --docdir=/usr/share/doc/gcc --htmldir=/usr/share/doc/gcc/html -C --build=i686-pc-cygwin --host=i686-pc-cygwin --target=i686-pc-cygwin --without-libiconv-prefix --without-libintl-prefix --libexecdir=/usr/lib --enable-shared --enable-shared-libgcc --enable-static --enable-version-specific-runtime-libs --enable-bootstrap --enable-__cxa_atexit --with-dwarf2 --with-arch=i686 --with-tune=generic --disable-sjlj-exceptions --enable-languages=ada,c,c++,fortran,java,lto,objc,obj-c++ --enable-graphite --enable-threads=posix --enable-libatomic --enable-libcilkrts --enable-libgomp --enable-libitm --enable-libquadmath --enable-libquadmath-support --enable-libssp --enable-libada --enable-libjava --enable-libgcj-sublibs --disable-java-awt --disable-symvers --with-ecj-jar=/usr/share/java/ecj.jar --with-gnu-ld --with-gnu-as --with-cloog-include=/usr/include/cloog-isl --without-libiconv-prefix --without-libintl-prefix --with-system-zlib --enable-linker-build-id --with-default-libstdcxx-abi=gcc4-compatible
Thread model: posix
gcc version 5.3.0 (GCC)

If it means anything at all, this comes up rainbows and butterflies using the clang 3.8 dist on my mac. No `?` to be found. — WhozCraig, Feb 10 '17 at 18:51
@WhozCraig Seems I am experiencing point #4 of [this related answer](http://stackoverflow.com/a/28631091/2410359) unlike your good platform. — chux - Reinstate Monica, Feb 10 '17 at 18:56
Well, that *sucks*. (like I had to tell you that). Sry, man. — WhozCraig, Feb 10 '17 at 18:58
Sounds like it's a problem with your math library: what does `ldd` on the resulting executable give? — Simon Byrne, Feb 10 '17 at 21:51
@SimonByrne "ntdll.dll => /cygdrive/c/Windows/SysWOW64/ntdll.dll (0x779f0000) kernel32.dll => /cygdrive/c/Windows/syswow64/kernel32.dll (0x77010000) KERNELBASE.dll => /cygdrive/c/Windows/syswow64/KERNELBASE.dll (0x755d0000) cygwin1.dll => /usr/bin/cygwin1.dll (0x61000000) ", but this may not reflect the .exe as posted above as I have been updating various tools. — chux - Reinstate Monica, Feb 10 '17 at 21:54
Why the `volatile` qualifiers? Not that they should affect the result, but why add them at all? — Jonathan Leffler, Feb 10 '17 at 22:14
@JonathanLeffler I was trying to insure math was done and arguments _passed_ to `printf()` as `double` and not `long double` considering `FLT_EVAL_METHOD 2`. — chux - Reinstate Monica, Feb 10 '17 at 22:39
From these results, I would guess your `fma` is implemented via `fma(a,b,c) = a*(long double)b+c`. Which is, of course, bogus. — tmyklebu, Feb 11 '17 at 01:42
What if you feed the compiler `-mfpmath=sse2`? Still garbage or does it work? — tmyklebu, Feb 11 '17 at 01:44
Why trust the c++11 standard? The standard is making a lot of assumptions that mathematically correct but they're often unrealistic to hardware implementation and efficiency. I guess most of the compilers won't follow this crazy c++11 fma() standard which implies 128-bit multiplication is necessary. — DU Jiaen, Feb 12 '18 at 08:07
@DUJiaen Unlike many standard library functions, `fma()` has specified precision: "... compute the value (as if) to infinite precision and round once to the result format ...". It may be inefficient to implement on various platforms, yet that is also true with many math.h functions. A library should be expected to meet the spec or it is not compliant. What is assumed is my library's choice to have fast access to an underlying fma-like processor instruction rather than meet the spec. The spec has no speed requirement, just functionality. — chux - Reinstate Monica, Feb 12 '18 at 15:25
@chux, nowadays there're often communicate gaps and misunderstandings between the standard c/c++ committee and the developers who actually implement the compilers. And just as the standard says "This operation is commonly implemented in hardware as fused multiply-add CPU instruction (http://en.cppreference.com/w/c/numeric/math/fma). The compiler developers can simply blame the CPU designers that it's their fault not having the infinite precision for FMA instruction. I personally think this is just some politics rather than who is right or wrong. — DU Jiaen, Feb 13 '18 at 02:50

Nominal Animal · Accepted Answer · 2017-02-12T11:33:28.557

It is Cygwin's fault. Or more correctly, the newlib C library it uses. It explicitly says it does not even try to get fma() emulation correct.

GNU C library has correct emulation for almost all fma variants since 2015. For details, and the patches used to implement this, see the sourceware bug 13304.

If efficiency is not a problem, then I would simply use e.g.

#if defined(__CYGWIN__) && !defined(__FMA__) && !defined(__FMA3__) && !defined(__FMA4__)
#define fma(x, y, z)  fma_emulation(x, y, z)

double fma_emulation(double x, double y, double z)
{
    /* One of the implementations linked above */
}
#endif

I do not personally use Windows at all, but if anyone does (use Windows and need the fma emulation), I'd suggest they try and push a patch upstream, with a link to the GNU C library discussion on correct fma emulation.

What I am wondering, is whether it would be possible to examine just the low M bits of the result (discarded in the rounding) to determine the correct value of the ULP in the result, and adjust the result obtained using the straightforward a×b+c operation accordingly, using nextafter(); rather than using multiprecision arithmetic to implement the entire operation.

Edit: No, because the addition may overflow, dropping an extra bit as the MSB of the discarded part. For that reason alone, we do need to do the entire operation. Another reason is that if a×b and c have different signs, then instead of addition, we substract smaller in magnitude from larger in magnitude (result using the larger's sign), which may lead to clearing several high bits in the larger, and that affects which bits of the entire result are dropped in the rounding.

However, for the IEEE-754 Binary64 double on x86 and x86-64 architectures, I do believe fma emulation using 64-bit (integer) registers and 128-bit product, is still quite feasible. I shall experiment on a representation where low 2 bits in a 64-bit register are used for the rounding decision bits (LSB being a logical OR of all the dropped bits), 53 bits used for the mantissa, and one carry bit, leaving 8 unused and ignored high bits. The rounding is performed when the unsigned integer mantissa is converted to a (64-bit) double. If these experiments yield anything useful, I shall describe them here.

Initial findings: fma() emulation on a 32-bit system is slow. The 80-bit stuff on the 387 FPU is basically useless here, and implementing the 53×53-bit multiplication (and bit shifting) on a 32-bit system is just .. not worth the effort. The glibc fma() emulation code linked to above is good enough in my opinion.

Additional findings: Handling non-finite values is nasty. (Subnormals are only slightly annoying, requiring special handling (as the implicit MSB in mantissa is zero then).) If any of the three arguments is non-finite (infinity or some form of NaN), then returning a*b + c (not fused) is the only sane option. Handling these cases require additional branching, which slows down the emulation.

Final decision: The number of cases to handle in an optimized manner (rather than using the multiprecision "limb" approach as used in the glibc emulation) is large enough to make this approach not worth the effort. If each limb is 64-bit, each of a, b, and c is spread over at most 2 limbs, and a×b over three limbs. (With 32-bit limbs, that is just 3 and 5 limbs, respectively.) Depending on whether a×b and c have the same or different signs, there are only two fundamentally different cases to handle -- in the different signs case, the addition turns into subtraction (smaller from larger, result getting the same sign as the larger value).

In short, the multiprecision approach is better. The actual precision needed is very well bounded, and does not need even dynamic allocation. If the product of the mantissas of a and b can be calculated efficiently, the multiprecision part is limited to holding the product and handling the addition/subtraction. Final rounding can be done by converting the result to a 53-bit mantissa, exponent, and two extra low bits (the higher being the most significant bit lost in the rounding, and the lower being an OR of the rest of the bits lost in the rounding). Essentially, the key operations can be done using integers (or SSE/AVX registers), and the final conversion from a 55-bit mantissa to double handles the rounding according to current rules.

"possible to examine just the low M bits of the result (discarded in the rounding)" --> I would not say so. The bits discarded include the lower `M` bits of the `2M` product and the `M` bits of `c` which may be far to the "right" and contribute to the round. As I see it, the precise sum may have a bit width up to (2*M + |Ae + Be - 2*Ce|) (or something that depends heavily of the exponent difference of AB vs C. (Assume bit[0] is MSBit) The sum's bit[M-1], bit[M] and the "or" of all less significant bits contribute to the round. — chux - Reinstate Monica, Feb 11 '17 at 17:24
I feel like Snoopy shaking his fist at the [cygwin fma/Red-Barron](https://s-media-cache-ak0.pinimg.com/736x/66/e6/95/66e69592a0f577ceacbdc3e845d617df.jpg) — chux - Reinstate Monica, Feb 11 '17 at 17:28
@chux: No, don't be fooled by the "infinite precision" analog. We only need to consider the bits that *affect the rounding*. There are six basic cases: ab and c have the same exponents and same signs; different signs; ab has a greater exponent, but same sign as c; different signs; and ab has a smaller exponent than c, but same sign; and different signs. For the four IEEE-754 rounding modes, we only need to know the most significant bit of the part that is dropped at rounding, and we can do that in each of the six cases with 2M-bit registers. Should I elaborate on this? (Is it useful?) — Nominal Animal, Feb 11 '17 at 18:08
For default IEEE rounding [ties to even](https://en.wikipedia.org/wiki/Rounding#Round_half_to_even) the MSBit of the dropped part is needed (we agree on that) _and_ also if any other dropped bits are 1. (this we appear to not agree). If any of the other dropped bits is 1, then the value is _above_ halfway and round away from 0. If all the other dropped bits are 0, the value rounds to even as it is a _tie_. All drop bits can affect in that rounding mode. — chux - Reinstate Monica, Feb 11 '17 at 19:29
@chux: Yes, I do agree on both, but in fact (I do not think that) it invalidates my proposed approach. Because there is a tiny theoretical possibility of leading to more efficient `fma()` emulation, I shall elaborate on my logic. :) — Nominal Animal, Feb 12 '17 at 05:31
@chux: So, my idea did not pan out. However, your note about the effect of the dropped bits on the rounding reminded me that if we pack the 53 most significant bits of the mantissa, the most significant bit of the dropped bits, and an extra bit that is the logical OR of the rest of the dropped bits, we get a 55-bit unsigned integer that gets rounded correctly when converted to a double. This means that a multiprecision approach using a small number (fixed upper limit) of limbs should handle the emulation efficiently. The rounding approach *might* be novel, but I doubt it. — Nominal Animal, Feb 12 '17 at 11:42
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/135503/discussion-between-chux-and-nominal-animal). — chux - Reinstate Monica, Feb 12 '17 at 14:41

Is my fma() broken?

1 Answers1

Linked