Questions tagged [fma]

Fused Multiply Add or Multiply-Accumulate

The Fused Multiply Add (also known as Multiply Accumulate) operation is when a multiplication followed by an addition or subtraction is done in a single operation with only one rounding at the end.

For example:

x = a * b + c

Would normally be done using two roundings without Fused-Multiply Add. (one after a * b and one after a * b + c)

Fused Multiply Add combines the two operations into a single operation thereby increasing accuracy in the computed result.

Supported Architectures include:

  • PowerPC
  • Intel x86 (via FMA3 instruction set)
  • AMD x86 (via FMA4 instruction set)
82 questions
48
votes
2 answers

How to use Fused Multiply-Add (FMA) instructions with SSE/AVX

I have learned that some Intel/AMD CPUs can do simultanous multiply and add with SSE/AVX: FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2. I like to know how to do this best in code and I also want to know how it's done internally in the…
user2088790
47
votes
1 answer

Obtaining peak bandwidth on Haswell in the L1 cache: only getting 62%

I'm attempting to obtain full bandwidth in the L1 cache for the following function on Intel processors float triad(float *x, float *y, float *z, const int n) { float k = 3.14159f; for(int i=0; i
Z boson
  • 32,619
  • 11
  • 123
  • 226
36
votes
2 answers

Significant FMA performance anomaly experienced in the Intel Broadwell processor

Code1: vzeroall mov rcx, 1000000 startLabel1: vfmadd231ps ymm0, ymm0, ymm0 vfmadd231ps ymm1, ymm1, ymm1 vfmadd231ps ymm2, ymm2, ymm2 vfmadd231ps ymm3, ymm3, ymm3 vfmadd231ps ymm4, ymm4, ymm4 vfmadd231ps ymm5,…
User9973
  • 559
  • 5
  • 8
23
votes
2 answers

FMA3 in GCC: how to enable

I have a i5-4250U which has AVX2 and FMA3. I am testing some dense matrix multiplication code in GCC 4.8.1 on Linux which I wrote. Below is a list of three difference ways I compile. SSE2: gcc matrix.cpp -o matrix_gcc -O3 -msse2 -fopenmp AVX:…
Z boson
  • 32,619
  • 11
  • 123
  • 226
19
votes
1 answer

AVX2: Computing dot product of 512 float arrays

I will preface this by saying that I am a complete beginner at SIMD intrinsics. Essentially, I have a CPU which supports the AVX2 instrinsic (Intel(R) Core(TM) i5-7500T CPU @ 2.70GHz). I would like to know the fastest way to compute the dot product…
cyrusbehr
  • 1,100
  • 1
  • 12
  • 32
16
votes
5 answers

How to get data out of AVX registers?

Using MSVC 2013 and AVX 1, I've got 8 floats in a register: __m256 foo = mm256_fmadd_ps(a,b,c); Now I want to call inline void print(float) {...} for all 8 floats. It looks like the Intel AVX intrisics would make this rather…
MSalters
  • 173,980
  • 10
  • 155
  • 350
16
votes
6 answers

Which algorithms benefit most from fused multiply add?

fma(a,b,c) is equivalent to a*b+c except it doesn't round intermediate result. Could you give me some examples of algorithms that non-trivially benefit from avoiding this rounding? It's not obvious, as rounding after multiplications which we avoid…
taw
  • 18,110
  • 15
  • 57
  • 76
16
votes
2 answers

Fused multiply add and default rounding modes

With GCC 5.3 the following code compield with -O3 -fma float mul_add(float a, float b, float c) { return a*b + c; } produces the following assembly vfmadd132ss %xmm1, %xmm2, %xmm0 ret I noticed GCC doing this with -O3 already in GCC…
Z boson
  • 32,619
  • 11
  • 123
  • 226
15
votes
3 answers

Can I use the AVX FMA units to do bit-exact 52 bit integer multiplications?

AXV2 doesn't have any integer multiplications with sources larger than 32-bit. It does offer 32 x 32 -> 32 multiplies, as well as 32 x 32 -> 64 multiplies1, but nothing with 64-bit sources. Let's say I need an unsigned multiply with inputs larger…
BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
14
votes
2 answers

Why does the FMA _mm256_fmadd_pd() intrinsic have 3 asm mnemonics, "vfmadd132pd", "231" and "213"?

Could someone explain to me why there are 3 variants of the fused multiply-accumulate instruction: vfmadd132pd, vfmadd231pd and vfmadd213pd, while there is only one C intrinsics _mm256_fmadd_pd? To make things simple, what is the difference between…
Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248
11
votes
2 answers

Preventing GCC from automatically using AVX and FMA instructions when compiled with -mavx and -mfma

How can I disable auto-vectorization with AVX and FMA instructions? I would still prefer the compiler to employ SSE and SSE2 automatically, but not FMA and AVX. My code that uses AVX checks for its availability, but GCC doesn't do it when…
Violet Giraffe
  • 32,368
  • 48
  • 194
  • 335
10
votes
2 answers

Automatically generate FMA instructions in MSVC

MSVC supports AVX/AVX2 instructions for years now and according to this msdn blog post, it can automatically generate fused-multiply-add (FMA) instructions. Yet neither of the following functions compile to FMA instruction: float func1(float x,…
plasmacel
  • 8,183
  • 7
  • 53
  • 101
10
votes
3 answers

Optimize for fast multiplication but slow addition: FMA and doubledouble

When I first got a Haswell processor I tried implementing FMA to determine the Mandelbrot set. The main algorithm is this: intn = 0; for(int32_t i=0; i
Z boson
  • 32,619
  • 11
  • 123
  • 226
10
votes
1 answer

Do FMA (fused multiply-add) instructions always produce the same result as a mul then add instruction?

I have this assembly (AT&T syntax): mulsd %xmm0, %xmm1 addsd %xmm1, %xmm2 I want to replace it with: vfmadd231sd %xmm0, %xmm1, %xmm2 Will this transformation always leave equivalent state in all involved registers and flags? Or will the result…
Daryl
  • 3,253
  • 4
  • 29
  • 39
9
votes
2 answers

Is floating point expression contraction allowed in C++?

Floating point expressions can sometimes be contracted on the processing hardware, e.g. using fused multiply-and-add as a single hardware operation. Apparently, using these this isn't merely an implementation detail but governed by programming…
einpoklum
  • 118,144
  • 57
  • 340
  • 684
1
2 3 4 5 6