Please do not say this is premature microoptimization. I want to understand, as much as it is possible given my limited knowledge, how the described SB feature and assembly works, and make sure that my code makes use of this architectural feature. Thank you for understanding.
I've started to learn intrinsics a few days ago so the answer may seem obvious to some, but I don't have a reliable source of information to figure this out.
I need to optimize some code for a Sandy Bridge CPU (this is a requirement). Now I know that it can do one AVX multiply and one AVX add per cycle, and read this paper:
http://research.colfaxinternational.com/file.axd?file=2012%2F7%2FColfax_CPI.pdf
which shows how it can be done in C++. So, the problem is that my code won't get auto-vectorized using Intel's compiler (which is another requirement for the task), so I decided to implement it manually using intrinsics like this:
__sum1 = _mm256_setzero_pd();
__sum2 = _mm256_setzero_pd();
__sum3 = _mm256_setzero_pd();
sum = 0;
for(kk = k; kk < k + BS && kk < aW; kk+=12)
{
const double *a_addr = &A[i * aW + kk];
const double *b_addr = &newB[jj * aW + kk];
__aa1 = _mm256_load_pd((a_addr));
__bb1 = _mm256_load_pd((b_addr));
__sum1 = _mm256_add_pd(__sum1, _mm256_mul_pd(__aa1, __bb1));
__aa2 = _mm256_load_pd((a_addr + 4));
__bb2 = _mm256_load_pd((b_addr + 4));
__sum2 = _mm256_add_pd(__sum2, _mm256_mul_pd(__aa2, __bb2));
__aa3 = _mm256_load_pd((a_addr + 8));
__bb3 = _mm256_load_pd((b_addr + 8));
__sum3 = _mm256_add_pd(__sum3, _mm256_mul_pd(__aa3, __bb3));
}
__sum1 = _mm256_add_pd(__sum1, _mm256_add_pd(__sum2, __sum3));
_mm256_store_pd(&vsum[0], __sum1);
The reason I manually unroll the loop like this is explained here:
Loop unrolling to achieve maximum throughput with Ivy Bridge and Haswell
They say you need to unroll by a factor of 3 to achieve the best performance on Sandy. My naive testing confirms that this indeed runs better than without unrolling or 4-fold unrolling.
OK, so here is the problem. The icl compiler from Intel Parallel Studio 15 generates this:
$LN149:
movsxd r14, r14d ;78.49
$LN150:
vmovupd ymm3, YMMWORD PTR [r11+r14*8] ;80.48
$LN151:
vmovupd ymm5, YMMWORD PTR [32+r11+r14*8] ;84.49
$LN152:
vmulpd ymm4, ymm3, YMMWORD PTR [r8+r14*8] ;82.56
$LN153:
vmovupd ymm3, YMMWORD PTR [64+r11+r14*8] ;88.49
$LN154:
vmulpd ymm15, ymm5, YMMWORD PTR [32+r8+r14*8] ;86.56
$LN155:
vaddpd ymm2, ymm2, ymm4 ;82.34
$LN156:
vmulpd ymm4, ymm3, YMMWORD PTR [64+r8+r14*8] ;90.56
$LN157:
vaddpd ymm0, ymm0, ymm15 ;86.34
$LN158:
vaddpd ymm1, ymm1, ymm4 ;90.34
$LN159:
add r14d, 12 ;76.57
$LN160:
cmp r14d, ebx ;76.42
$LN161:
jb .B1.19 ; Prob 82% ;76.42
To me, this looks like a mess, where the correct order (add next to multiply required to use the handy SB feature) is broken.
Question:
Will this assembly code leverage the Sandy Bridge feature I am referring to?
If not, what do I need to do in order to utilize the feature and prevent the code from becoming "tangled" like this?
Also, when there is only one loop iteration, the order is nice and clean, i.e. load, multiply, add, as it should be.