Having codes of this nature:
void foo(double *restrict A, double *restrict x,
double *restrict y) {
y[5] += A[4] * x[5];
y[5] += A[5] * x[1452];
y[5] += A[6] * x[3373];
}
The result of the compilation using gcc 10.2
and flags -O3 -mfma -mavx2 -fvect-cost-model=unlimited
(Compiler Explorer), is:
foo(double*, double*, double*):
vmovsd xmm1, QWORD PTR [rdx+40]
vmovsd xmm0, QWORD PTR [rdi+32]
vfmadd132sd xmm0, xmm1, QWORD PTR [rsi+40]
vmovsd xmm2, QWORD PTR [rdi+40]
vfmadd231sd xmm0, xmm2, QWORD PTR [rsi+11616]
vmovsd xmm3, QWORD PTR [rdi+48]
vfmadd231sd xmm0, xmm3, QWORD PTR [rsi+26984]
vmovsd QWORD PTR [rdx+40], xmm0
ret
It does not pack any data together (4 vmovsd
for loading data and 1 for storing), performing 3 vfmaddXXXsd
. Nonetheless, my motivation to vectorize this is that it could be done using only one vfmadd231pd
. My "cleanest" attempt to write this code using intrinsics for AVX2 is:
void foo_intrin(double *restrict A, double *restrict x,
double *restrict y) {
__m256d __vop0, __vop1,__vop2;
__m128d __lo256, __hi256;
// THE ISSUE
__vop0 = _mm256_maskload_pd(&A[4], _mm256_set_epi64x(0,-1,-1,-1));
__vop1 = _mm256_mask_i64gather_pd(_mm256_setzero_pd(), &x[5],
_mm256_set_epi64x(0,3368, 1447, 0),
_mm256_set_pd(0,-1,-1,-1), 8);
// 1 vs 3 FMADD, "the gain"
__vop2 = _mm256_fmadd_pd(__vop0, __vop1, __vop2);
// reducing 4 double elements:
// Peter Cordes' answer https://stackoverflow.com/a/49943540/2856041
__lo256 = _mm256_castpd256_pd128(__vop2);
__hi256 = _mm256_extractf128_pd(__vop2, 0x1);
__lo256 = _mm_add_pd(__lo256, __hi256);
// question:
// could you use here shuffle instead?
// __hi256 = _mm_shuffle_pd(__lo256, __lo256, 0x1);
__hi256 = _mm_unpackhi_pd(__lo256, __lo256);
__lo256 = _mm_add_pd(__lo256, __hi256);
y[5] += __lo256[0];
}
Which generates the following ASM:
foo_intrin(double*, double*, double*):
vmovdqa ymm2, YMMWORD PTR .LC1[rip]
vmovapd ymm3, YMMWORD PTR .LC2[rip]
vmovdqa ymm0, YMMWORD PTR .LC0[rip]
vmaskmovpd ymm1, ymm0, YMMWORD PTR [rdi+32]
vxorpd xmm0, xmm0, xmm0
vgatherqpd ymm0, QWORD PTR [rsi+40+ymm2*8], ymm3
vxorpd xmm2, xmm2, xmm2
vfmadd132pd ymm0, ymm2, ymm1
vmovapd xmm1, xmm0
vextractf128 xmm0, ymm0, 0x1
vaddpd xmm0, xmm0, xmm1
vunpckhpd xmm1, xmm0, xmm0
vaddpd xmm0, xmm0, xmm1
vaddsd xmm0, xmm0, QWORD PTR [rdx+40]
vmovsd QWORD PTR [rdx+40], xmm0
vzeroupper
ret
.LC0:
.quad -1
.quad -1
.quad -1
.quad 0
.LC1:
.quad 0
.quad 1447
.quad 3368
.quad 0
.LC2:
.long 0
.long -1074790400
.long 0
.long -1074790400
.long 0
.long -1074790400
.long 0
.long 0
Sorry if anyone has an anxiety attack right now, I am deeply sorry. Let's break this down:
- I guess those
vxorpd
are for cleaning registers, buticc
just generate one, not two. - According to Agner Fog, VCL does not use
maskload
in AVX2 since "masked instructions are very slow in instruction sets prior to AVX512". However, in uops.info it is reported, for Skylake ("regular", no AVX-512), that:- VMOVAPD (YMM, M256), e.g.
_mm256_load_pd
has latencies [≤5;≤8] and throughput of 0.5. - VMASKMOVPD (YMM, YMM, M256), e.g.
_mm256_maskload_pd
has latencies [1;≤9] and throughput of 0.5 as well, but decoded in two uops instead of one. Is this difference so huge? Is it better to pack in a different manner?
- VMOVAPD (YMM, M256), e.g.
- Regarding the
mask_gather
-fashion instructions, as far as I understand with all documentation above, it delivers the same performance whether it uses mask or not, is this correct? Both uops.info and Intel Intrinsics Guide report the same performance and ASM form; it is probable that I am missing something.- Is it better a
gather
than a "simple"set
in all cases? Talking in intrinsics terms. I am aware thatset
generatesvmov
-type instructions depending on the type of data (e.g. if data are constants it may just load an address, as in.LC0
,.LC1
and.LC2
).
- Is it better a
- According to Intel Intrinsics,
_mm256_shuffle_pd
and_mm256_unpackhi_pd
have the same lantecy and throughtput; the first one generates avpermildp
and the second onevunpckhpd
, and uops.info also reports same values. Is there any difference?
Last but not least, is this ad-hoc vectorization worth it? I just do not mean my intrinsics code, but the concept of vectorizing codes like this. I suspect that there are too many data movements to perform comparing the clean code compilers, in general, produce, so my concern is to improve the way of packing that non-contigous data.