Which algorithms benefit most from fused multiply add?

Question

fma(a,b,c) is equivalent to a*b+c except it doesn't round intermediate result.

Could you give me some examples of algorithms that non-trivially benefit from avoiding this rounding?

It's not obvious, as rounding after multiplications which we avoid tends to be less problematic than rounding after addition which we don't.

score 6 · Answer 1 · answered Aug 28 '10 at 15:34

The only thing I found so far are "error-free transformations". For any floating point numbers errors from a+b, a-b, and a*b are also floating point numbers (in round to nearest mode, assuming no overflow/underflow etc. etc.).

Addition (and obviously subtraction) error is easy to compute; if abs(a) >= abs(b), error is exactly b-((a+b)-a) (2 flops, or 4-5 if we don't know which is bigger). Multiplication error is trivial to compute with fma - it is simply fma(a,b,-a*b). Without fma it's 16 flops of rather nasty code. And fully generic emulation of correctly rounded fma is even slower than that.

Extra 16 flops of error tracking per flop of real computation is a huge overkill, but with just 1-5 pipeline-friendly flops it's quite reasonable, and for many algorithms based on that 50%-200% overhead of error tracking and compensation results in error as small as if all calculations were done in twice the number of bits they were, avoiding ill-conditioning in many cases.

Interestingly, fma isn't ever used in these algorithms to compute results, just to find errors, because finding error of fma is a slow as finding error of multiplication was without fma.

Relevant keywords to search would be "compensated Horner scheme" and "compensated dot product", with Horner scheme benefiting a lot more.

I wonder how the hardware cost of FMA on `float` values would compare with the hardware cost of an operation which added the full-precision product of two `float` values to a `double`. By my understanding, the cost hardware of a `double` multiply is more than four times that of an equally-fast `float` multiply yielding full-precision result, and for many operations like dot-product it's necessary to maintain intermediate values with more precision than the operands or final result. Using a multiply and fma together might work, but using a f*f+d operation would seem twice as fast. — supercat, May 11 '15 at 23:15

score 6 · Accepted Answer · answered Sep 07 '10 at 14:24

taw hit on one important example; more generally, FMA allows library writers to efficiently implement many other floating-point operations with correct rounding.

For example, a platform that has an FMA can use it to implement correctly rounded divide and square root (PPC and Itanium took this approach), which lets the FPU be basically a single-purpose FMA machine. Peter Tang and John Harrison (Intel), and Peter Markstein (HP) have some papers that explain this use if you're curious.

The example taw gave is more broadly useful than just in tracking error bounds. It allows you to represent the product of two floating point numbers as a sum of two floating point numbers without any rounding error; this is quite useful in implementing correctly-rounded floating-point library functions. Jean-Michel Muller's book or the papers on crlibm would be good starting places to learn more about these uses.

FMA is also broadly useful in argument reduction in math-library style routines for certain types of arguments; when one is doing argument reduction, the goal of the computation is often a term of the form (x - a*b), where (a*b) is very nearly equal to x itself; in particular, the result is often on the order of the rounding error in the (a*b) term, if this is computed without an FMA. I believe that Muller has also written some about this in his book.

score 2 · Answer 3 · answered Aug 28 '10 at 04:14

2

The primary benefit of FMA is that it can be twice as fast. Rather than take 1 cycle for the multiply and then 1 cycle for the add, the FPU can issue both operations in the same cycle. Obviously, most algorithms will benefit from faster operations.

answered Aug 28 '10 at 04:14

Gabe

84,912
12
139
238

2

Question is about impact of rounding, not about this. Your answer is also incorrect as fma requires 3 input floating point unit instead of standard 2 inputs, extra port in floating point register file, and wider floating point adders This isn't free, it's a trade-off of fma support at cost of some other hardware. – taw Aug 28 '10 at 13:26
taw: You asked what algorithms benefit from FMA and for some examples where the rounding is a non-trivial benefit. I answered the first part, which is that most algorithms will benefit. – Gabe Aug 28 '10 at 16:17

score 2 · Answer 4 · answered Aug 28 '10 at 06:18

2

Some examples: Vector dot products. Fourier transforms. Digital signal processing. Polynomials. All sorts of things.

It's a question of optimization and hardware exploitation more that anything else. A sum of products is a very common requirement in numerical methods, and this way lets you give an explicit instruction to the compiler about how to do a thing fast and perhaps with a little more precision. Unless I'm mistaken, the compiler is free to replace a=b*c+d with an FMA instruction, but it's also free not to. (unless the standard calls for rounding, but real-world compilers routinely violate standards in small ways).

answered Aug 28 '10 at 06:18

Ian

4,421
1
20
17

2

The compiler can't legally replace b*c+d with an FMA unless you specifically tell the compiler that it is OK (with -ffast-math or something similar), because it does perturb results. – Stephen Lin Aug 14 '13 at 17:11
@StephenLin: Assuming that the evaluation of `b`, `c`, and `d` does not mutate state or have other side effects, how can such a hardware optimization "perturb results"? – stakx - no longer contributing Jul 16 '14 at 06:29
@stakx: Many of the composite instructions in a floating-point instruction set are there because the rounding error would swamp the result. Example: if you take e^(close-to-zero) the result is close to one, but that limits your precision greatly. If you have one instruction representing e^epsilon-1, then the hardware can give much greater precision. Any given high-level language can be defined as to offer access to the more precise instruction or to rewrite the expression tree under recognizable circumstances. The former is more predictable. – Ian Jul 23 '14 at 04:57
1

@stakx FMA(b,c,d) only has one rounding step whereas b*c+d has two, so in any case where b*c is not exactly representable as a floating point, the two results will be different. – Stephen Lin Aug 17 '14 at 06:46

score 1 · Answer 5 · answered Jul 16 '14 at 06:17

It has been pretty well explained on the Wikipedia entry for FMA that the algorithms which have something to do with accumulation of products benefit most from using FMA:

A fast FMA can speed up and improve the accuracy of 
many computations that involve the accumulation of products:

 * Dot product
 * Matrix multiplication
 * Polynomial evaluation (e.g., with Horner's rule)
 * Newton's method for evaluating functions.

score 1 · Answer 6 · answered Aug 28 '10 at 04:10

1

Off the top of my head - Matrix multiplication, Newton's rule, polynomial evaluation, numerical methods

answered Aug 28 '10 at 04:10

WOPR

5,313
6
47
63

Which algorithms benefit most from fused multiply add?

6 Answers6