I'm using the Xbyak library. I use avx512 instruction to calculate dot product of two vectors.
the first unrolling strategy: load 16*4 float32 from each vector, and do the fmadd operation.
// variable A and B are the registers store base address of vector A and B
// zmm0,1,2,3 load 16*4 float32 from A
// zmm4,5,6,7 load 16*4 float32 from B
// zmm8 stores the temp accumulation
vmovups(zmm0, ptr[A]);
vmovups(zmm1, ptr[A+64]);
vmovups(zmm2, ptr[A+64*2]);
vmovups(zmm3, ptr[A+64*3]);
vmovups(zmm4, ptr[B]);
vmovups(zmm5, ptr[B+64]);
vmovups(zmm6, ptr[B+64*2]);
vmovups(zmm7, ptr[B+64*3]);
vfmadd231ps(zmm8, zmm0, zmm4);
vfmadd231ps(zmm8, zmm1, zmm5);
vfmadd231ps(zmm8, zmm2, zmm6);
vfmadd231ps(zmm8, zmm3, zmm7);
the second unrolling strategy: load 16 float32 from each vector, and do the fmadd operation
vmovups(zmm0, ptr[A]);
vmovups(zmm4, ptr[B]);
vfmadd231ps(zmm8, zmm0, zmm4);
vmovups(zmm1, ptr[A+64]);
vmovups(zmm5, ptr[B+64]);
vfmadd231ps(zmm8, zmm1, zmm5);
vmovups(zmm2, ptr[A+64*2]);
vmovups(zmm6, ptr[B+64*2]);
vfmadd231ps(zmm8, zmm2, zmm6);
vmovups(zmm3, ptr[A+64*3]);
vmovups(zmm7, ptr[B+64*3]);
vfmadd231ps(zmm8, zmm3, zmm7);
does the two unrolling strategies make difference(, does the out-of-order execution unit makes the two code have no difference)? which is prefered? or if the two strategies are not optimized, what is the optimized way to do this?
I also tried to use zmm8,9,10,11
to store the temp accumulations, and at the end of loop, use vaddps
to add the 4 registers. Can it make the code run faster?
I'm going to write some benchmarks to see how the two piece of code performce in the real world. But I think there must be some prefered way to write such code. I do have some google searching, but not get much information about the question.