How to use AVX/SIMD with nested loops and += format?

Question

I am writing a page rank program. I am writing a method for updating the rankings. I have successful got it working with nested for loops and also a threaded version. However I would like to instead use SIMD/AVX.

This is the code I would like to change into a SIMD/AVX implementation.

#define IDX(a, b) ((a * npages) + b)   // 2D matrix indexing
for (size_t i = 0; i < npages; i++) {
    temp[i] = 0.0;
    for (size_t j = 0; j < npages; j++) {
        temp[i] += P[j] * matrix_cap[IDX(i,j)];
    }
}

For this code P[] is of size npages and matrix_cap[] is of size npages * npages. P[] is the ranks of the pages and temp[] is used to store the next iterations page ranks so as to be able to check convergence.

I don't know how to interpret += with AVX and how I would get my data which involves two arrays/vectors of size npages and one matrix of size npages * npages (in row major order) into a format of which could be used with SIMD/AVX operations.

As far as AVX this is what I have so far though it's very very incorrect and was just a stab at what I would roughly like to do.

ssize_t g_mod = npages - (npages % 4);
double* res = malloc(sizeof(double) * npages);
double sum = 0.0;
for (size_t i = 0; i < npages; i++) {
    for (size_t j = 0; j < mod; j += 4) {
        __m256d p = _mm256_loadu_pd(P + j);
        __m256d m = _mm256_loadu_pd(matrix_hat + i + j);
        __m256d pm = _mm256_mul_pd(p, m);
        _mm256_storeu_pd(&res + j, pm);
        for (size_t k = 0; k < 4; k++) {
            sum += res[j + k];
        }
    }
    for (size_t i = mod; i < npages; i++) {
        for (size_t j = 0; j < npages; j++) {
            sum += P[j] * matrix_cap[IDX(i,j)];
        }
    }
    temp[i] = sum;
    sum = 0.0;
}

How to can I format my data so I can use AVX/SIMD operations (add,mul) on it to optimise it as it will be called a lot.

Don't allocate memory and free it in your loop – just reuse it. The malloc/free in your outer loop will take much longer than the actual calculations. You probably don't have to hand-SIMD this at all. Also, you don't seem to really understand what _mm*load/store_u/_a does. You will have to read up on basics. As this, your code is pretty much plain wrong in every aspect. — Marcus Müller, May 30 '16 at 11:44
I knew allocating memory and free-ing it was wrong, I just wanted to get my concept down however I can fix that quickly. I thought it was. My understanding is that load - loads 4 unaligned values from the array into a variable that can be used with SIMD operations then I assumed store would be used to put the value back into C code. — joshuatvernon, May 30 '16 at 11:48
Also, your question is really lacking a question. What *is your precise question?*. — Marcus Müller, May 30 '16 at 11:50
I'll edit the question as it does appear to be very poorly posed. I wanted to know how to get the data (n * n sized matrix and n size vector) into a format to use with SIMD/AVX. I displayed the code I original have to perform what I need to however I have been successful in using SIMD with single for loops where I just have to add values together from an array. This however I can't seem to think of a way to get the data ready for. — joshuatvernon, May 30 '16 at 11:55
@MarcusMüller I updated the question to better reflect what I am asking. Does it explain it better? — joshuatvernon, May 30 '16 at 12:01
The main problem with SIMDificating this code might be `matrix_cap[IDX(i,j)];`. Depending on what `IDX()` does, this might require a gather-load, which is glacially slow on all but the latest Intel x86's. — EOF, May 30 '16 at 12:12
@EOF Oh I didn't even realise I still had the define in there. `IDX()` is for `#define IDX(a, b) ((a * npages) + b)` for the matrix in row major order. Just so it'll work just like a 2d array matrix_cap[i][j] — joshuatvernon, May 30 '16 at 12:14
@joshuatvernon: In that case, I'd recommend looking at the output of `gcc` or `clang` for your scalar code. Chances are, they already vectorize (given appropriate optimization options) — EOF, May 30 '16 at 12:17
@EOF: You can definitely do a lot better than what you'd get from auto-vectorization, by arranging to reuse each `P[j]` load for multiple `i` values. That will help except when `P[]` fits in L1 but the matrix doesn't even fit in L2. — Peter Cordes, May 31 '16 at 02:18

score 3 · Answer 1 · answered May 30 '16 at 19:54

Consider using OpenMP4.x #pragma omp simd reduction for innermost loop. Take in mind that omp reductions are not applicable to C++ arrays, therefore you have to use temporary reduction variable like shown below.

#define IDX(a, b) ((a * npages) + b)   // 2D matrix indexing
for (size_t i = 0; i < npages; i++) {
    my_type tmp_reduction = 0.0; // was: // temp[i] = 0.0;
    #pragma omp simd reduction (+:tmp_reduction)
    for (size_t j = 0; j < npages; j++) {
        tmp_reduction += P[j] * matrix_cap[IDX(i,j)];
    }
    temp[i] = tmp_reduction;
}

For x86 platforms, OpenMP4.x is currently supported by fresh GCC (4.9+) and Intel Compilers. Some LLVM and PGI compilers may also support it.

P.S. Auto-vectorization ("auto" means vectorization by compiler without any pragmas, i.e. without explicit gudiance from developers) may sometimes work for some compiler variants (although it's very unlikely due to array element as reduction variable). However it is strictly speaking incorrect to auto-vectorize this code. You have to use explicit SIMD pragma to "resolve" reduction dependency and (as a good side-effect) disambiguate pointers (in case arrays are accessed via pointer).

That's better than nothing, but probably won't use multiple accumulators for different `i` values in parallel, to reduce the number of times each `P[j]` is loaded (see my answer). Having the compiler take care of cleanup/alignment is super-convenient. Good point that FP reductions can't auto-vectorize without `-ffast-math` or an OpenMP pragma, though. — Peter Cordes, May 30 '16 at 22:42

score 2 · Accepted Answer · edited May 23 '17 at 11:52

First, EOF is right, you should see how well gcc/clang/icc do at auto-vectorizing your scalar code. I can't check for you, because you only posted code-fragments, not anything I can throw on http://gcc.godbolt.org/.

You definitely don't need to malloc anything. Notice that your intrinsics version only ever uses 32B at a time of res[], and always overwrites whatever was there before. So you might as well use a single 32B array. Or better, use a better method to get a horizontal sum of your vector.

(see the bottom for a suggestion on a different data arrangement for the matrix)

Calculating each temp[i] uses every P[j], so there is actually something to be gained from being smarter about vectorizing. For every load from P[j], use that vector with 4 different loads from matrix_cap[] for that j, but 4 different i values. You'll accumulate 4 different vectors, and have to hsum each of them down to a temp[i] value at the end.

So your inner loop will have 5 read streams (P[] and 4 different rows of matrix_cap). It will do 4 horizontal sums, and 4 scalar stores at the end, with the final result for 4 consecutive i values. (Or maybe do two shuffles and two 16B stores). (Or maybe transpose-and-sum together, which is actually a good use-case for the shuffling power of the expensive _mm256_hadd_pd (vhaddpd) instruction, but be careful of its in-lane operation)

It's probably even better to accumulate 8 to 12 temp[i] values in parallel, so every load from P[j] is reused 8 to 12 times. (check the compiler output to make sure you aren't running out of vector regs and spilling __m256d vectors to memory, though.) This will leave more work for the cleanup loop.

FMA throughput and latency are such that you need 10 vector accumulators to keep 10 FMAs in flight to saturate the FMA unit on Haswell. Skylake reduced the latency to 4c, so you only need 8 vector accumulators to saturate it on SKL. (See the x86 tag wiki). Even if you're bottlenecked on memory, not execution-port throughput, you will want multiple accumulators, but they could all be for the same temp[i] (so you'd vertically sum them down to one vector, then hsum that).

However, accumulating results for multiple temp[i] at once has the large advantage of reusing P[j] multiple times after loading it. You also save the vertical adds at the end. Multiple read streams may actually help hide the latency of a cache miss in any one of the streams. (HW prefetchers in Intel CPUs can track one forward and one reverse stream per 4k page, IIRC). You might strike a balance, and use two or three vector accumulators for each of 4 temp[i] results in parallel, if you find that multiple read streams are a problem, but that would mean you'd have to load the same P[j] more times total.

So you should do something like

#define IDX(a, b) ((a * npages) + b)   // 2D matrix indexing
for (size_t i = 0; i < (npages & (~7ULL)); i+=8) {
    __m256d s0 = _mm256_setzero_pd(),
            s1 = _mm256_setzero_pd(),
            s2 = _mm256_setzero_pd(),
            ...
            s7 = _mm256_setzero_pd();   // 8 accumulators for 8 i values
    for (size_t j = 0; j < (npages & ~(3ULL)); j+=4) {
        __m256d Pj = _mm256_loadu_pd(P+j);        // reused 8 times after loading
        //temp[i] += P[j] * matrix_cap[IDX(i,j)];
        s0 = _mm256_fmadd_pd(Pj, _mm256_loadu_pd(&matrix_cap[IDX(i+0,j)]), s0);
        s1 = _mm256_fmadd_pd(Pj, _mm256_loadu_pd(&matrix_cap[IDX(i+1,j)]), s1);
        // ...
        s7 = _mm256_fmadd_pd(Pj, _mm256_loadu_pd(&matrix_cap[IDX(i+7,j)]), s7);
    }
    // or do this block with a hsum+transpose and do vector stores.
    // taking advantage of the power of vhaddpd to be doing 4 useful hsums with each instructions.
    temp[i+0] = hsum_pd256(s0);   // See the horizontal-sum link earlier for how to write this function
    temp[i+1] = hsum_pd256(s1);
    //...
    temp[i+7] = hsum_pd256(s7);

    // if npages isn't a multiple of 4, add the last couple scalar elements to the results of the hsum_pd256()s.
}
// TODO: cleanup for the last up-to-7 odd elements.

You could probably write __m256d sums[8] and loop over your vector accumulators, but you'd have to check that the compiler fully unrolls it and still actually keeps everything live in registers.

How to can I format my data so I can use AVX/SIMD operations (add,mul) on it to optimise it as it will be called a lot.

I missed this part of the question earlier. First of all, obviously float will and give you 2x the number of elements per vector (and per unit of memory bandwidth). The factor of 2 less memory / cache footprint might give more speedup than that if cache hit rate increases.

Ideally, the matrix would be "striped" to match the vector width. Every load from the matrix would get a vector of matrix_cap[IDX(i,j)] for 4 adjacent i values, but the next 32B would be the next j value for the same 4 i values. This means that each vector accumulator is accumulating the sum for a different i in each element, so no need for horizontal sums at the end.

P[j] stays linear, but you broadcast-load each element of it, for use with 8 vectors of 4 i values each (or 8 vec of 8 is for float). So you increase your reuse factor for P[j] loads by a factor of the vector width. Broadcast-loads are near-free on Haswell and later (still only take a load-port uop), and plenty cheap for this on SnB/IvB where they also take a shuffle-port uop.

How to use AVX/SIMD with nested loops and += format?

2 Answers2

So you should do something like

Linked