Questions tagged [fma]

Fused Multiply Add or Multiply-Accumulate

The Fused Multiply Add (also known as Multiply Accumulate) operation is when a multiplication followed by an addition or subtraction is done in a single operation with only one rounding at the end.

For example:

x = a * b + c

Would normally be done using two roundings without Fused-Multiply Add. (one after a * b and one after a * b + c)

Fused Multiply Add combines the two operations into a single operation thereby increasing accuracy in the computed result.

Supported Architectures include:

PowerPC
Intel x86 (via FMA3 instruction set)
AMD x86 (via FMA4 instruction set)

82 questions

votes

2 answers

How to use Fused Multiply-Add (FMA) instructions with SSE/AVX

I have learned that some Intel/AMD CPUs can do simultanous multiply and add with SSE/AVX: FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2. I like to know how to do this best in code and I also want to know how it's done internally in the…

asked Apr 10 '13 at 18:02

user2088790

votes

1 answer

Obtaining peak bandwidth on Haswell in the L1 cache: only getting 62%

I'm attempting to obtain full bandwidth in the L1 cache for the following function on Intel processors float triad(float *x, float *y, float *z, const int n) { float k = 3.14159f; for(int i=0; i

c memory assembly nasm fma

asked Sep 17 '14 at 20:03

Z boson

32,619
11
123
226

votes

2 answers

Significant FMA performance anomaly experienced in the Intel Broadwell processor

Code1: vzeroall mov rcx, 1000000 startLabel1: vfmadd231ps ymm0, ymm0, ymm0 vfmadd231ps ymm1, ymm1, ymm1 vfmadd231ps ymm2, ymm2, ymm2 vfmadd231ps ymm3, ymm3, ymm3 vfmadd231ps ymm4, ymm4, ymm4 vfmadd231ps ymm5,…

performance assembly x86 intel fma

asked Dec 16 '15 at 10:35

User9973

votes

2 answers

FMA3 in GCC: how to enable

I have a i5-4250U which has AVX2 and FMA3. I am testing some dense matrix multiplication code in GCC 4.8.1 on Linux which I wrote. Below is a list of three difference ways I compile. SSE2: gcc matrix.cpp -o matrix_gcc -O3 -msse2 -fopenmp AVX:…

c++ gcc intel avx fma

asked Jan 08 '14 at 16:37

Z boson

32,619
11
123
226

votes

1 answer

AVX2: Computing dot product of 512 float arrays

I will preface this by saying that I am a complete beginner at SIMD intrinsics. Essentially, I have a CPU which supports the AVX2 instrinsic (Intel(R) Core(TM) i5-7500T CPU @ 2.70GHz). I would like to know the fastest way to compute the dot product…

c++ simd avx2 dot-product fma

asked Dec 27 '19 at 00:23

cyrusbehr

1,100
1
12
32

votes

5 answers

How to get data out of AVX registers?

Using MSVC 2013 and AVX 1, I've got 8 floats in a register: __m256 foo = mm256_fmadd_ps(a,b,c); Now I want to call inline void print(float) {...} for all 8 floats. It looks like the Intel AVX intrisics would make this rather…

c++ visual-c++ avx fma

asked Jun 03 '16 at 10:51

MSalters

173,980
10
155
350

votes

6 answers

Which algorithms benefit most from fused multiply add?

fma(a,b,c) is equivalent to a*b+c except it doesn't round intermediate result. Could you give me some examples of algorithms that non-trivially benefit from avoiding this rounding? It's not obvious, as rounding after multiplications which we avoid…

floating-point fma

asked Aug 28 '10 at 04:04

taw

18,110
15
57
76

votes

2 answers

Fused multiply add and default rounding modes

With GCC 5.3 the following code compield with -O3 -fma float mul_add(float a, float b, float c) { return a*b + c; } produces the following assembly vfmadd132ss %xmm1, %xmm2, %xmm0 ret I noticed GCC doing this with -O3 already in GCC…

c gcc clang ieee-754 fma

asked Dec 23 '15 at 12:57

Z boson

32,619
11
123
226

votes

3 answers

Can I use the AVX FMA units to do bit-exact 52 bit integer multiplications?

AXV2 doesn't have any integer multiplications with sources larger than 32-bit. It does offer 32 x 32 -> 32 multiplies, as well as 32 x 32 -> 64 multiplies1, but nothing with 64-bit sources. Let's say I need an unsigned multiply with inputs larger…

floating-point x86 simd avx2 fma

asked Dec 30 '16 at 22:54

BeeOnRope

60,350
16
207
386

votes

2 answers

Why does the FMA _mm256_fmadd_pd() intrinsic have 3 asm mnemonics, "vfmadd132pd", "231" and "213"?

Could someone explain to me why there are 3 variants of the fused multiply-accumulate instruction: vfmadd132pd, vfmadd231pd and vfmadd213pd, while there is only one C intrinsics _mm256_fmadd_pd? To make things simple, what is the difference between…

assembly x86 simd instruction-set fma

asked Apr 03 '16 at 21:57

Zheyuan Li

71,365
17
180
248

votes

2 answers

Preventing GCC from automatically using AVX and FMA instructions when compiled with -mavx and -mfma

How can I disable auto-vectorization with AVX and FMA instructions? I would still prefer the compiler to employ SSE and SSE2 automatically, but not FMA and AVX. My code that uses AVX checks for its availability, but GCC doesn't do it when…

c++ gcc vectorization avx fma

asked Sep 18 '13 at 09:14

Violet Giraffe

32,368
48
194
335

votes

2 answers

Automatically generate FMA instructions in MSVC

MSVC supports AVX/AVX2 instructions for years now and according to this msdn blog post, it can automatically generate fused-multiply-add (FMA) instructions. Yet neither of the following functions compile to FMA instruction: float func1(float x,…

c++ visual-c++ x86 avx fma

asked Dec 14 '15 at 11:32

plasmacel

8,183
7
53
101

votes

3 answers

Optimize for fast multiplication but slow addition: FMA and doubledouble

When I first got a Haswell processor I tried implementing FMA to determine the Mandelbrot set. The main algorithm is this: intn = 0; for(int32_t i=0; i

assembly x86 floating-point fma double-double-arithmetic

asked Jun 01 '15 at 12:25

Z boson

32,619
11
123
226

votes

1 answer

Do FMA (fused multiply-add) instructions always produce the same result as a mul then add instruction?

I have this assembly (AT&T syntax): mulsd %xmm0, %xmm1 addsd %xmm1, %xmm2 I want to replace it with: vfmadd231sd %xmm0, %xmm1, %xmm2 Will this transformation always leave equivalent state in all involved registers and flags? Or will the result…

assembly floating-point x86 fma

asked Mar 16 '15 at 20:29

Daryl

3,253
4
29
39

votes

2 answers

Is floating point expression contraction allowed in C++?

Floating point expressions can sometimes be contracted on the processing hardware, e.g. using fused multiply-and-add as a single hardware operation. Apparently, using these this isn't merely an implementation detail but governed by programming…

c++ floating-point fma

asked Mar 14 '18 at 12:44

einpoklum

118,144
57
340
684

2 3 4 5 6 Next