Latency and number of FMA units

Question

I'm trying to implement the convolution algoritm descibed in this paper. The authors state that the number of independent elements processed by FMA instructions is lower bounded by the latency of FMA istructions and it is upper bounded by the number/width of vector registers in the following way:

N_vec * N_fma * L_fma < X < N_reg * N_vec

Where:

N_vec: Number of elements contained in a vector register
N_reg: Number of vector registers
N_fma: Number of FMA units
L_fma: Latency of one FMA instruction

I'm using an Intel Core i7-10510U, and I set the parameters as follow:

NAME	VALUE
N_vec	8
N_reg	16
N_fma	2
L_fma	4

This is motivated by the following reasons: I'm using 256b registers (N_reg=16). I'm using 4B single precision floating point data (N_vec=8). My question is about the latency and the number of FMA units. From Intel Intrisic guide I see that on my architecture (i.e., Skylake) the _mm256_fmadd_ps instruction has a throughput of 0.5 (cycle/instructions) and a latency of 4 cycles. For this reason I assumed to have 2 FMA units.

By doing so I obtaing the bounds for X:

(8 * 2 * 4) < X < (16 * 8) => 64 < X < 128

However, by running some experiments, I see that the execution time is shorter when I use X=256 or X=512.

Am I getting any of the above parameters wrong? Especially N_fma and L_fma

How much shorter is your execution time? Using only just barely the number of parallel operations to hide FP latency means any imperfect scheduling loses a cycle on a dependency chain, and the latency bottleneck means you can't catch back up. See [Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators)](https://stackoverflow.com/q/45113527) for some experiments with number of accumulators for an FP dot-product on Skylake. — Peter Cordes, Apr 11 '22 at 17:04
Note also that saturating FMA leaves for example only one port for jump & branch. Any extraneous code at all can easily knock things sideways. — Useless, Apr 11 '22 at 17:07
@PeterCordes Thanks a lot for useful answer and references. The execution time in such cases is about 10% shorter. After reading your references I removed the constraints on X and I applied 8 loop unrolling (Intel compiler appies by itself 4 loop unrolling). Now performance is much better, but this contradicts what is said in the [paper](https://arxiv.org/pdf/1809.10170.pdf). — Mirco Mannino, Apr 12 '22 at 07:14
Tuning source-code is compiler-dependent. I haven't read that paper, but maybe they were experimenting with a different one. Have you checked the resulting asm to see if there are a lot of store/reload of vectors to stack space (which would be a sign of running out of registers)? If the asm isn't too much of a mess, you might be able to figure out how many times the loop is unrolled (perhaps from finding a pointer-increment instruction) — Peter Cordes, Apr 12 '22 at 07:21

Latency and number of FMA units

0 Answers0