Why? Is _mm_setr_pd expensive?
Somewhat; it takes at least a shuffle. More importantly in this case, computing each scalar operand is expensive, and as @spectras' answer shows, gcc at least fails to auto-vectorize that into paddd
/ cvtdq2pd
. Instead it re-computes each operand from a scalar integer, doing the int
->double
conversion separately, then shuffles those together.
This is the function I'm trying to optimize:
You're simply filling an array with a linear function. You're re-multiplying every time inside the loop. That avoids a loop-carried dependency on anything except the integer loop counter, but you run into throughput bottlenecks from doing so much work inside the loop.
i.e. you're computing a[i] = c + i*scale
separately for every step. But instead you can strength-reduce that to a[i+n] = a[i] + (n*scale)
. So you only have one addpd
instruction per vector of results.
This will introduce some rounding error that accumulates vs. redoing the computation from scratch, but double
is probably overkill for what you're doing anyway.
It also comes at the cost of introducing a serial dependency on an FP add instead of integer. But you already have a loop-carried FP add dependency chain in your "optimized" version that uses sampleIndex_acc = _mm_add_pd(sampleIndex_acc, sampleIndex_add);
inside the loop, using FP += 2.0 instead of re-converting from integer.
So you'll want to unroll with multiple vectors to hide that FP latency, and keep at least 3 or 4 FP additions in flight at once. (Haswell: 3 cycle latency, one per clock throughput. Skylake: 4 cycle latency, 2 per clock throughput.) See also Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? for more about unrolling with multiple accumulators for a similar problem with loop-carried dependencies (a dot product).
void Process(int voiceIndex, int blockSize) {
double *pC = c[voiceIndex];
double val0 = start + step * delta;
double deltaValue = rate * delta;
__m128d vdelta2 = _mm_set1_pd(2 * deltaValue);
__m128d vdelta4 = _mm_add_pd(vdelta2, vdelta2);
__m128d v0 = _mm_setr_pd(val0, val0 + deltaValue);
__m128d v1 = _mm_add_pd(v0, vdelta2);
__m128d v2 = _mm_add_pd(v0, vdelta4);
__m128d v3 = _mm_add_pd(v1, vdelta4);
__m128d vdelta8 = _mm_mul_pd(vdelta2, _mm_set1_pd(4.0));
double *endp = pC + blocksize - 7; // stop if there's only room for 7 or fewer doubles
// or use -8 and have your cleanup handle lengths of 1..8
// since the inner loop always calculates results for next iteration
for (; pC < endp ; pC += 8) {
_mm_store_pd(pC, v0);
v0 = _mm_add_pd(v0, vdelta8);
_mm_store_pd(pC+2, v1);
v1 = _mm_add_pd(v1, vdelta8);
_mm_store_pd(pC+4, v2);
v2 = _mm_add_pd(v2, vdelta8);
_mm_store_pd(pC+6, v3);
v3 = _mm_add_pd(v3, vdelta8);
}
// if (blocksize % 8 != 0) ... store final vectors
}
The choice of whether to add or multiply when building up vdelta4
/ vdelta8
is not very significant; I tried to avoid too long a dependency chain before the first stores can happen. Since v0
through v3
need to be calculated as well, it seemed to make sense to create a vdelta4
instead of just making a chain of v2 = v1+vdelta2
. Maybe it would have been better to create vdelta4
with a multiply from 4.0*delta
, and double it to get vdelta8
. This could be relevant for very small block size, especially if you cache-block your code by only generating small chunks of this array as needed, right before it will be read.
Anyway, this compiles to a very efficient inner loop with gcc and MSVC (on the Godbolt compiler explorer).
;; MSVC -O2
$LL4@Process: ; do {
movups XMMWORD PTR [rax], xmm5
movups XMMWORD PTR [rax+16], xmm0
movups XMMWORD PTR [rax+32], xmm1
movups XMMWORD PTR [rax+48], xmm2
add rax, 64 ; 00000040H
addpd xmm5, xmm3 ; v0 += vdelta8
addpd xmm0, xmm3 ; v1 += vdelta8
addpd xmm1, xmm3 ; v2 += vdelta8
addpd xmm2, xmm3 ; v3 += vdelta8
cmp rax, rcx
jb SHORT $LL4@Process ; }while(pC < endp)
This has 4 separate dependency chains, through xmm0, 1, 2, and 5. So there's enough instruction-level parallelism to keep 4 addpd
instructions in flight. This is more than enough for Haswell, but half of what Skylake can sustain.
Still, with a store throughput of 1 vector per clock, more than 1 addpd
per clock isn't useful. In theory this can run at about 16 bytes per clock cycle, and saturate store throughput. i.e. 1 vector / 2 double
s per clock.
AVX with wider vectors (4 double
s) could still go at 1 vector per clock on Haswell and later, i.e. 32 bytes per clock. (Assuming the output array is hot in L1d cache or possibly even L2.)
Even better: don't store this data in memory at all; re-generate on the fly.
Generate it on the fly when it's needed, if the code consuming it only reads it a few times, and is also manually vectorized.