If there's a closed-form formula for n
steps ahead, you can use that to sidestep serial dependencies. If it can be computed with just different coefficients with the same operations as for 1 step, a broadcast is all you need.
Like in this case, z1 = c + z1 * b
, so applying that twice we get
# I'm using Z0..n as the elements in the series your loop calculates
Z2 = c + (c+Z0*b)*b
= c + c*b + Z0*b^2
c + c*b
and b^2
are both constants, if I'm understanding your code correctly that all the C variables are really just C variables, not pseudocode for an array reference. (So everything except your z1
are loop invariant).
So if we have a SIMD vector of 2 elements, starting with Z0 and Z1, we can step each of them forward by 2 to get Z2 and Z3.
void SmoothBlock(int blockSize, double b, double c, double z_init) {
// z1 = inputA0 + z1 * b1;
__m128d zv = _mm_setr_pd(z_init, z_init*b + c);
__m128d step2_mul = _mm_set1_pd(b*b);
__m128d step2_add = _mm_set1_pd(c + c*b);
for (int i = 0; i < blockSize-1; i+=2) {
_mm_storeu_pd(mSmoothedValues + i, zv);
zv = _mm_mul_pd(zv, step2_mul);
zv = _mm_add_pd(zv, step2_add);
// compile with FMA + fast-math for the compiler to fold the mul/add into one FMA
}
// handle final odd element if necessary
if(blockSize%2 != 0)
_mm_store_sd(mSmoothedValues+blockSize-1, zv);
}
With float
+ AVX (8 elements per vector), you'd have
__m256 zv = _mm256_setr_ps(z_init, c + z_init*b,
c + c*b + z_init*b*b,
c + c*b + c*b*b + z_init*b*b*b, ...);
// Z2 = c + c*b + Z0*b^2
// Z3 = c + c*b + (c + Z0*b) * b^2
// Z3 = c + c*b + c*b^2 + Z0*b^3
and the add/mul factors would be for 8 steps.
Normally people use float
for SIMD because you get twice as many elements per vector, and half the memory bandwidth / cache footprint, so you typically get a factor of 2 speedup over float
. (Same number of vectors / bytes per clock.)
The above loop on a Haswell or Sandybridge for example CPU will run at one vector per 8 cycles, bottlenecked on the latency of mulpd
(5 cycles) + addps
(3 cycles). We generate 2 double
results per vector, but that's still a huge bottleneck compared to 1 mul and 1 add per clock throughput. We're missing out on a factor of 8 throughput.
(Or if compiled with one FMA instead of mul->add, then we have 5 cycle latency).
Sidestepping the serial dependcy is useful not just for SIMD but for avoiding bottlenecks on FP add/mul (or FMA) latency will give further speedups, up to the ratio of FP add+mul latency to add+mul throughput.
Simply unroll more, and use multiple vectors, like zv0
, zv1
, zv2
, zv3
. This increases the number of steps you make at once, too. So for example, 16-byte vectors of float
, with 4 vectors, would be 4x4 = 16 steps.