Is there any scenario where function fma in libc can be used?

Question

I come across this page and find there is an odd floating multiply add function --fma and fmaf. It says that the result is something like:

 (x * y) + z             #fma(x,y,z)

And the value is infinite precision and round once to the result format .

However, AFAICT I've never seen such a ternary operation before. So I'm wondering what's the cumstom usage for this func.

'MAC' is frequently used in digital signal processing: http://en.wikipedia.org/wiki/Multiply%E2%80%93accumulate_operation — ArjunShankar, Nov 08 '12 at 15:28

Pascal Cuoq · Accepted Answer · 2015-03-24T12:05:56.643

The important aspect of the fused-multiply-add instruction is the (virtually) infinite precision of the intermediate result. This helps with performance, but not so much because two operations are encoded in a single instruction — It helps with performance because the virtually infinite precision of the intermediate result is sometimes important, and very expensive to recover with ordinary multiplication and addition when this level of precision is really what the programmer is after.

Example: comparing `a * b` to `1.0`

Suppose that it is crucial to an algorithm to determine where the product of two double-precision numbers a and b is with respect to a nonzero constant (we'll use 1.0). The numbers a and b both have full significands of binary digits. If you compute a*b as a double, the result may be 1.0, but that does not tell you whether the actual mathematical product was slightly below 1.0 and rounded up to exactly 1.0, or slightly above 1.0 and rounded down. Without FMA, your options are:

compute a*b as a quad-precision number. Quad-precision is not implemented in hardware but there are software emulation libraries. In quad-precision, the mathematical result of the product is exactly representable and you can then compare it to 1.0.
Compute a*b in double precision in round-upward mode and in round-downward mode. If both results are 1.0, it means a*b is exactly 1.0. If RU(a * b) is greater than 1.0, it means the mathematical product is higher than 1.0, and if RD(a * b) is below 1.0, that means the mathematical product is lower than 1.0. On most processors, this approach means changing the rounding mode three times, and each change is expensive (it involves flushing the CPU pipeline).

With a FMA instruction, one can compute fma(a, b, -1.0) and compare the result to 0.0. Since floating-point numbers are denser around zero, and since the intermediate product is not rounded in the computation, we can be certain that fma(a, b, -1.0) > 0 means the mathematical product of a and b is greater than 1, and so on.

Example: Veltkamp/Dekker multiplication

The double-double format is an efficient representation of numbers as the sum of two double-precision floating-point numbers. It is nearly as precise as quad-precision but takes advantage of existing double-precision hardware.

Consider the following function, Mul12(a, b), that takes two double-precision numbers a and b and computes their product as a double-double number. An algorithm, due to Veltkamp and Dekker, computes this function with only double-precision addition and multiplication (reference). It takes 6 multiplications (one is part of each Split() plus four in the main body of the algorithm), and plenty of additions.

If a FMA instruction is available, Mul12 can be implemented as two operations, one multiplication and one FMA.

high = a * b; /* double-precision approximation of the real product */
low = fma(a, b, -high); /* remainder of the real product */
/* now the real product of a and b is available as the sum of high and low */

More examples

Examples where FMA is used for its precision, and not only as an instruction that does a multiplication and an addition, are the computation of square root and division. These operations have to be correctly rounded (to the nearest floating-point number of the mathematical result) according to the IEEE 754 standard. These two operations can be implemented efficiently when a hardware FMA instruction is available. This aspect is typically hidden by the compilation chain, but the IA-64 instruction set (Itanium) did not have an instruction for division. Instead, the correctly rounded division could be obtained by a sequence of instructions (typically generated by the compiler) involving FMA.

@Zboson Thanks, I have replaced the reference by another one that looks right after a quick scan. — Pascal Cuoq, Mar 24 '15 at 12:07
I did not really understand the virtues of FMA until your answer. The performance benefit from FMA is not as great as I have expected. But this is because I was not interested in additional precision. In cases where additional precision matters, such as in implementing `double-double` I can see it is very helpful. Thanks! — Z boson, Mar 24 '15 at 12:17
[this](https://stackoverflow.com/questions/30573443/optimize-for-fast-multiplication-but-slow-addition-fma-and-doubledouble) may interest you. — Z boson, Jun 01 '15 at 12:27
In your second example, why can't fma(a, b, 0) simply be returned? — Royce, Jan 02 '16 at 21:42
@Royce Because the end result of the fused multiply-add is a double-precision floating-point number. Only the intermediate result has (virtually) infinite precision (note: “virtually” because in practice, since it is known that this intermediate result is only intended to be added to a double to return a double, not all bits may be represented at the same time in the processor). If you compute `fma(a, b, 0)`, you just get the double nearest the mathematical product a*b (same as if you had computed `a*b`). To see the digits that come after these, cancel the first ones with `fma(a, b, -a*b)`. — Pascal Cuoq, Jan 02 '16 at 23:53

score 2 · Answer 2 · answered Nov 08 '12 at 15:27

2

It's usually used as an optimization. Most floating-point units have an fma instruction, so the calculation can be performed in a single instruction instead of two or more. So for performance-critical floating-point code, it's a helpful function to have.

answered Nov 08 '12 at 15:27

Adam Rosenfield

390,455
97
512
589

3

The “one instruction instead of two” aspect is the least important of FMA. The number of cycles an FMA instruction takes, when it exists, does not have to be less than the number of cycles taken by a multiplication and an addition. It is at the very least as expensive as the multiplication, with a somewhat negligible addition that may or may not be thrown in for free. I'll write an answer. – Pascal Cuoq Aug 14 '13 at 18:13

Is there any scenario where function fma in libc can be used?

2 Answers2

Example: comparing `a * b` to `1.0`

Example: Veltkamp/Dekker multiplication

More examples

Linked

Is there any scenario where function fma in libc can be used?

2 Answers2

Example: comparing a * b to 1.0

Example: Veltkamp/Dekker multiplication

More examples

Linked

Example: comparing `a * b` to `1.0`