First, EOF is right, you should see how well gcc/clang/icc do at auto-vectorizing your scalar code. I can't check for you, because you only posted code-fragments, not anything I can throw on http://gcc.godbolt.org/.
You definitely don't need to malloc anything. Notice that your intrinsics version only ever uses 32B at a time of res[]
, and always overwrites whatever was there before. So you might as well use a single 32B array. Or better, use a better method to get a horizontal sum of your vector.
(see the bottom for a suggestion on a different data arrangement for the matrix)
Calculating each temp[i]
uses every P[j]
, so there is actually something to be gained from being smarter about vectorizing. For every load from P[j]
, use that vector with 4 different loads from matrix_cap[]
for that j
, but 4 different i
values. You'll accumulate 4 different vectors, and have to hsum each of them down to a temp[i]
value at the end.
So your inner loop will have 5 read streams (P[] and 4 different rows of matrix_cap
). It will do 4 horizontal sums, and 4 scalar stores at the end, with the final result for 4 consecutive i
values. (Or maybe do two shuffles and two 16B stores). (Or maybe transpose-and-sum together, which is actually a good use-case for the shuffling power of the expensive _mm256_hadd_pd
(vhaddpd
) instruction, but be careful of its in-lane operation)
It's probably even better to accumulate 8 to 12 temp[i]
values in parallel, so every load from P[j]
is reused 8 to 12 times. (check the compiler output to make sure you aren't running out of vector regs and spilling __m256d
vectors to memory, though.) This will leave more work for the cleanup loop.
FMA throughput and latency are such that you need 10 vector accumulators to keep 10 FMAs in flight to saturate the FMA unit on Haswell. Skylake reduced the latency to 4c, so you only need 8 vector accumulators to saturate it on SKL. (See the x86 tag wiki). Even if you're bottlenecked on memory, not execution-port throughput, you will want multiple accumulators, but they could all be for the same temp[i]
(so you'd vertically sum them down to one vector, then hsum that).
However, accumulating results for multiple temp[i]
at once has the large advantage of reusing P[j]
multiple times after loading it. You also save the vertical adds at the end. Multiple read streams may actually help hide the latency of a cache miss in any one of the streams. (HW prefetchers in Intel CPUs can track one forward and one reverse stream per 4k page, IIRC). You might strike a balance, and use two or three vector accumulators for each of 4 temp[i]
results in parallel, if you find that multiple read streams are a problem, but that would mean you'd have to load the same P[j]
more times total.
So you should do something like
#define IDX(a, b) ((a * npages) + b) // 2D matrix indexing
for (size_t i = 0; i < (npages & (~7ULL)); i+=8) {
__m256d s0 = _mm256_setzero_pd(),
s1 = _mm256_setzero_pd(),
s2 = _mm256_setzero_pd(),
...
s7 = _mm256_setzero_pd(); // 8 accumulators for 8 i values
for (size_t j = 0; j < (npages & ~(3ULL)); j+=4) {
__m256d Pj = _mm256_loadu_pd(P+j); // reused 8 times after loading
//temp[i] += P[j] * matrix_cap[IDX(i,j)];
s0 = _mm256_fmadd_pd(Pj, _mm256_loadu_pd(&matrix_cap[IDX(i+0,j)]), s0);
s1 = _mm256_fmadd_pd(Pj, _mm256_loadu_pd(&matrix_cap[IDX(i+1,j)]), s1);
// ...
s7 = _mm256_fmadd_pd(Pj, _mm256_loadu_pd(&matrix_cap[IDX(i+7,j)]), s7);
}
// or do this block with a hsum+transpose and do vector stores.
// taking advantage of the power of vhaddpd to be doing 4 useful hsums with each instructions.
temp[i+0] = hsum_pd256(s0); // See the horizontal-sum link earlier for how to write this function
temp[i+1] = hsum_pd256(s1);
//...
temp[i+7] = hsum_pd256(s7);
// if npages isn't a multiple of 4, add the last couple scalar elements to the results of the hsum_pd256()s.
}
// TODO: cleanup for the last up-to-7 odd elements.
You could probably write __m256d sums[8]
and loop over your vector accumulators, but you'd have to check that the compiler fully unrolls it and still actually keeps everything live in registers.
How to can I format my data so I can use AVX/SIMD operations (add,mul) on it to optimise it as it will be called a lot.
I missed this part of the question earlier. First of all, obviously float
will and give you 2x the number of elements per vector (and per unit of memory bandwidth). The factor of 2 less memory / cache footprint might give more speedup than that if cache hit rate increases.
Ideally, the matrix would be "striped" to match the vector width. Every load from the matrix would get a vector of matrix_cap[IDX(i,j)]
for 4 adjacent i
values, but the next 32B would be the next j
value for the same 4 i
values. This means that each vector accumulator is accumulating the sum for a different i
in each element, so no need for horizontal sums at the end.
P[j]
stays linear, but you broadcast-load each element of it, for use with 8 vectors of 4 i
values each (or 8 vec of 8 i
s for float
). So you increase your reuse factor for P[j]
loads by a factor of the vector width. Broadcast-loads are near-free on Haswell and later (still only take a load-port uop), and plenty cheap for this on SnB/IvB where they also take a shuffle-port uop.