How is fma() implemented

Question

According to the documentation, there is a fma() function in math.h. That is very nice, and I know how FMA works and what to use it for. However, I am not so certain how this is implemented in practice? I'm mostly interested in the x86 and x86_64 architectures.

Is there a floating-point (non-vector) instruction for FMA, perhaps as defined by IEEE-754 2008?

Is FMA3 or FMA4 instruction used?

Is there an intrinsic to make sure that a real FMA is used, when the precision is relied upon?

On x86 and x86_64, gcc emits fma instructions if it's told it's allowed to (`-mfma` or `-mfma4` or `-march=something` where `something` is a fma-supporting processor). On Linux, you might look at `sysdeps/ieee754/dbl-64/s_fma.c` in glibc to get an idea of what the library function fallback looks like. — tmyklebu, Feb 20 '15 at 15:17

score 8 · Accepted Answer · answered Feb 20 '15 at 14:17

8

The actual implementation varies from platform to platform, but speaking very broadly:

If you tell your compiler to target a machine with hardware FMA instructions (PowerPC, ARM with VFPv4 or AArch64, Intel Haswell or AMD Bulldozer and onwards), the compiler may replace calls to fma( ) by just dropping the appropriate instruction into your code. This is not guaranteed, but is generally good practice. Otherwise you will get a call to the math library, and:
When running on a processor that has hardware FMA, those instructions should be used to implement the function. However, if you have an older version of your operating system, or an older version of the math library, it may not take advantage of those instructions.
If you are running on a processor that does not have hardware FMA, or you are using an older (or just not very good) math library, then a software implementation of FMA will be used instead. This might be implemented using clever extended-precision floating-point tricks, or with integer arithmetic.
The result of the fma( ) function should always be correctly rounded (i.e. a "real fma"). If it is not, that's a bug in your system's math library. Unfortunately, fma( ) is one of the more difficult math library functions to implement correctly, so many implementations have bugs. Please report them to your library vendor so they get fixed!

Is there an intrinsic to make sure that a real FMA is used, when the precision is relied upon?

Given a good compiler, this shouldn't be necessary; it should suffice to use the fma( ) function and tell the compiler what architecture you are targeting. However, compilers are not perfect, so you may need to use the _mm_fmadd_sd( ) and related intrinsics on x86 (but report the bug to your compiler vendor!)

answered Feb 20 '15 at 14:17

Stephen Canon

103,815
19
183
269

2

“An opportunity to explain round-to-odd is like the tour de France: one waits for it for a long time and it passes quickly.” – Pascal Cuoq Feb 20 '15 at 15:00
@PascalCuoq IEEE-754 uses round to even by default, if I'm not mistaken. Why is round to odd relevant in this context? I'm currently implementing a multiple-precision library so I'm little bit familiar with the inner workings, but I've not heard of round to odd being particularly important. Very poetic btw, well done! – the swine Feb 20 '15 at 17:18
@theswine If you have a format with twice the width of the FMA you aim for, you can do the multiplication without error. Say that you are implementing `fmaf` with a double-precision `double`. You are left with the problem of adding a `double` value of `(double)a*(double)b` and a `float c` and round this addition to the nearest `float`. This operation is not generally available but can be implemented as a `double` addition in round-to-odd followed by rounding from `double` to `float` in round-to-nearest. Not using round-to-odd for the intermediate result causes double-rounding issues. – Pascal Cuoq Feb 20 '15 at 17:23
@theswine See it in code: http://permalink.gmane.org/gmane.comp.lib.glibc.alpha/15546 – Pascal Cuoq Feb 20 '15 at 17:26
@PascalCuoq Oh, I see. That is rather elegant. If I was implementing fma in software, I would go in the way of calculating the approximate result and the correction, as an expansion. This can then be rounded to the nearest representable number. It has the advantage that it can be done with double as well (without the need for long double support), but for floats it would be likely slower in terms of operation count. However, changing the x86 rounding mode is quite slow, so in the end it might be comparable to double and round to odd – the swine Feb 20 '15 at 17:29
@PascalCuoq did you write that patch by any chance? – the swine Feb 20 '15 at 17:29
1

I didn't write the patch I linked to, but I did write a correct (AFAICT) `fmaf` using the naïve approach: (assuming a, b, c ≥ 0) http://ideone.com/kx7MXE . And you should also look at this implementation if you are interested in the subject: http://www.opensource.apple.com/source/Libm/Libm-315/Source/Intel/xmm_fma.c – Pascal Cuoq Feb 20 '15 at 17:33
@PascalCuoq are you sure your `fmaf` is correct? If I understand it correctly, you are multiplying / adding exactly representable integers, yielding an exactly representable result. While that tests that it indeed calculates `a + b * c`, I believe it does not test the rounding, not to mention overflow handling. I could be wrong, though, and I'm certainly not claiming the routine as flawed - just the unit test. – the swine Feb 21 '15 at 14:00
@theswine I think both the implementation and the test are pretty good for a comment in reply to a tangential remark on a blog post (the context in which this function was written). I do not understand what you are saying about rounding. Are you saying that there is no rounding in `float truefma = (long long) a * b + c;`? – Pascal Cuoq Feb 21 '15 at 14:15
@theswine I admit that the test is unnecessarily limited: all these masks with `0xfffff` should be masks with `0xffffff`. I will see if ideone lets me change it. – Pascal Cuoq Feb 21 '15 at 14:28
@PascalCuoq sorry, I did not mean to criticize and I think your code is great, but you wrote it for some purpose, not because of me. And I'm grateful for a bunch of interesting papers about round-to-odd. I was just wondering if the test could be improved somehow. I think if you multiply slightly longer numbers as int64, and then shift to 23 fraction bits (since it is float), you will be able to simulate and check the rounding. – the swine Feb 22 '15 at 11:23
@theswine rounding in the conversion of `a`, `b` or `c` to `float` is unnecessary and indeed a nuisance that would have to be controlled if it happened. As long as there is rounding in the conversion of the `long long` value `(long long) a * b + c` to `float` (and such a conversion happens implicitly in the assignment `float truefma = …`), the test is testing the `myfma`'s behavior with respect to rounding. Rounding happens during this assignment because of the values of the various masks applied to the results of `rand()`. Therefore, the test is testing `myfma`'s behavior with rounding. – Pascal Cuoq Feb 22 '15 at 11:38
@theswine You understand what `(rand() & 0xFFFFFF) << (rand() & 7)` and `(long long)(rand() & 0xFFFFFF) << (rand() & 31)` do, right? – Pascal Cuoq Feb 22 '15 at 11:42
@PascalCuoq It generates two (up to) 24-bit numbers, which are both exactly representable as float (one implicit bit + 23 fraction ones). Their product is going to be twice the width so it would round off and the sum likewise. I see, I thought you would need to explicitly round the integer to nearest even, I did not realize it happens automatically in the conversion to float - I expected to see some bit operations there. It will not check if FPU mode is e.g. round towards zero (but then again, your routine would likely not work anyway in such case, you just never know if it happens). – the swine Feb 22 '15 at 13:44
@PascalCuoq However, you may want to make sure that RAND_MAX is a power of two greater than `0xffffff`, on some platforms it is not. Apologies for lengthy comments. If you just hate the discussion at this point, we don't have to continue. I'm just trying to contribute. – the swine Feb 22 '15 at 13:47

Z boson · Answer 2 · 2015-06-02T14:54:16.197

6

One way to implement FMA in software is by splitting the significant into high and low bits. I use Dekker's algorithm

typedef struct { float hi; float lo; } doublefloat;  
doublefloat split(float a) {
    float t = ((1<<12)+1)*a;
    float hi = t - (t - a);
    float lo = a - hi;
    return (doublefloat){hi, lo};
}

Once you split the the float you can calculate a*b-c with a single rounding like this

float fmsub(float a, float b, float c) {
    doublefloat as = split(a), bs = split(b);
    return ((as.hi*bs.hi - c) + as.hi*bs.lo + as.lo*bs.hi) + as.lo*bs.lo;
}

This basically subtracts c from (ahi,alo)*(bhi,blo) = (ahi*bhi + ahi*blo + alo*bhi + alo*blo).

I got this idea from the twoProd function in the paper Extended-Precision Floating-Point Numbers for GPU Computation and from the mul_sub_x function in Agner Fog's vector class library. He uses a different function for splitting vectors of floats which splits differently. I tried to reproduce a scalar version here

typedef union {float f; int i;} u;
doublefloat split2(float a) {
    u lo, hi = {a};
    hi.i &= -(1<<12);
    lo.f = a - hi.f;
    return (doublefloat){hi.f,lo.f};
}

In any case using split or split2 in fmsub agrees well with fma(a,b,-c) from the math library in glibc. For whatever reason my version is significantly faster than fma except on a machine that has hardware fma (in which case I use _mm_fmsub_ss anyway).

edited Jun 02 '15 at 14:54

answered May 08 '15 at 10:09

Z boson

32,619
11
123
226

1

Nice references. I'm aware of Schewchuk's and Priest's work. In this question I was more interested about what instructions there are in the current instructions sets. I guess `_mm_fmadd_ss` pretty much sums it up. – the swine May 12 '15 at 12:49
Your version might be faster since it does not handle special numbers (especially infinities). I might be wrong but it seems that multiply / add with infinity will cause Dekker's algorithm to generate NaNs. I'd expect the runtime to behave correctly there, hence the speed penalty. – the swine May 12 '15 at 12:51
There is far more than `_mm_fmadd_ss` for the x86 set (and `_mm_fmadd_ps` is more interesting to me anyway) if you want to see all of them go to [IntrinsicsGuide](https://software.intel.com/sites/landingpage/IntrinsicsGuide/) and under Technologies select FMA. – Z boson May 12 '15 at 13:11
@theswine, good point about the special numbers. That may explain the speed penalty with `fma` for glibc. – Z boson May 13 '15 at 08:18
1

What about computing `a*b+c`? – plasmacel Jun 19 '18 at 18:44
1

@plasmacel, I think you change `(as.hi*bs.hi - c)` to `(as.hi*bs.hi + c)`. – Z boson Jun 20 '18 at 10:17

score 5 · Answer 3 · answered Jan 10 '17 at 21:49

5

Z boson's FMA suggestion based on Dekker's algorithm is unfortunately incorrect. Unlike in Dekker's twoProduct, in the more general FMA case the magnitude of c is not known relative to the product terms, and hence wrong cancellations can occur.

So, while Dekker's twoProduct can be greatly accelerated with a hardware FMA, the error term computation of Dekker's twoProduct is not a robust FMA implementation.

A correct implementation would need to either use a summation algorithm with higher than double precision, or add the terms in decreasing order of magnitude.

answered Jan 10 '17 at 21:49

aki

51
1
2

Note that he is doing `fmsub`. Assuming that the quantities are positive, I'd say his implementation works. Anyway, a bright comment from someone with 11 xp, good work. – the swine Jan 14 '17 at 00:27
2

Yeah, no, you're right. If `c` is very small then it is swamped by roundoff when subtracted from `ahi*bhi` and it does not help at all. He'd need to form a longer expansion and start adding from the smallest element using essentially what is known as Kahan's summation. Even though the result is rounded to float, this ordering still matters as it can affect the roundoff direction. – the swine Jan 14 '17 at 00:30
I wrote a quick remark about the Kahan summation not quite being sufficient here, and then realized you really meant doing _both_, sorting the input by magnitude and then adding with Kahan summation.I completely agree that the combination will produce a correctly rounded FMA result. – aki Jan 17 '17 at 23:01

How is fma() implemented

3 Answers3

Linked