3

I'm using the Xbyak library. I use avx512 instruction to calculate dot product of two vectors.

the first unrolling strategy: load 16*4 float32 from each vector, and do the fmadd operation.

// variable A and B are the registers store base address of vector A and B
// zmm0,1,2,3 load 16*4 float32 from A
// zmm4,5,6,7 load 16*4 float32 from B
// zmm8 stores the temp accumulation
vmovups(zmm0, ptr[A]);
vmovups(zmm1, ptr[A+64]);
vmovups(zmm2, ptr[A+64*2]);
vmovups(zmm3, ptr[A+64*3]);

vmovups(zmm4, ptr[B]);
vmovups(zmm5, ptr[B+64]);
vmovups(zmm6, ptr[B+64*2]);
vmovups(zmm7, ptr[B+64*3]);

vfmadd231ps(zmm8, zmm0, zmm4);
vfmadd231ps(zmm8, zmm1, zmm5);
vfmadd231ps(zmm8, zmm2, zmm6);
vfmadd231ps(zmm8, zmm3, zmm7);

the second unrolling strategy: load 16 float32 from each vector, and do the fmadd operation

vmovups(zmm0, ptr[A]);
vmovups(zmm4, ptr[B]);
vfmadd231ps(zmm8, zmm0, zmm4);

vmovups(zmm1, ptr[A+64]);
vmovups(zmm5, ptr[B+64]);
vfmadd231ps(zmm8, zmm1, zmm5);

vmovups(zmm2, ptr[A+64*2]);
vmovups(zmm6, ptr[B+64*2]);
vfmadd231ps(zmm8, zmm2, zmm6);

vmovups(zmm3, ptr[A+64*3]);
vmovups(zmm7, ptr[B+64*3]);
vfmadd231ps(zmm8, zmm3, zmm7);

does the two unrolling strategies make difference(, does the out-of-order execution unit makes the two code have no difference)? which is prefered? or if the two strategies are not optimized, what is the optimized way to do this?


I also tried to use zmm8,9,10,11 to store the temp accumulations, and at the end of loop, use vaddps to add the 4 registers. Can it make the code run faster?

I'm going to write some benchmarks to see how the two piece of code performce in the real world. But I think there must be some prefered way to write such code. I do have some google searching, but not get much information about the question.

haipeng
  • 31
  • 2
  • 2
    OoO exec means schedulign doesn't make much difference. You want memory-source FMA to save front-end bandwidth, not two separate `vmovups` instructions. – Peter Cordes Dec 29 '22 at 11:11
  • 2
    Unrolling but accumulating into a single register won't help (significantly), this only wastes registers -- also, there is no benefit in having a second `vmovups` instead of using a memory source in `vfmadd`. Having multiple accumulators will help, since it hides the latency of `vfmadd`. – chtz Dec 29 '22 at 11:12
  • 2
    You definitely want at least 4 different accumulator registers, preferably 8 or more (4 cycle latency, 0.5 c throughput = 8 in flight on Skylake-avx512 and most later Intel, although 2 loads per FMA bottlenecks on load throughput so 4 accumulators would be just barely enough). See [Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators)](https://stackoverflow.com/q/45113527) - it has some real benchmarks on Skylake with YMM registers, which is very similar if data's hot in L1d cache. – Peter Cordes Dec 29 '22 at 11:12

0 Answers0