1

I have just begun playing around with my vectorising code. My matrix-vector multiplication code is not being autovectorised by gcc, I’d like to know why. This pastebin contains the output from -fopt-info-vec-missed.

I’m having trouble understanding what the output is telling me and seeing how it matches up to what I’ve written in code.

For instance, I see a number of lines saying not enough data-refs in basic block, I can’t find much detail online with a google search about this. I also see that there’s issues relating to memory alignment e.g. Unknown misalignment, naturally aligned and vector alignment may not be reachable. All of my memory allocation was for double types using malloc, which I believed was guaranteed to be aligned for that type.

Environment: compiling with gcc on WSL2

gcc -v: gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

#define N 4000 // Matrix size will be N x N
#define T 1

//gcc -fopenmp -g vectorisation.c -o main -O3 -march=native -fopt-info-vec-missed=missed.txt

void doParallelComputation(double *A, double *V, double *results, unsigned long matrixSize, int numThreads)
{
    omp_set_num_threads(numThreads);
    unsigned long i, j;

    #pragma omp parallel for simd private(j)
    for (i = 0; i < matrixSize; i++)
    {
        // double *AHead = &A[i * matrixSize];
        // double tmp = 0;

        for (j = 0; j < matrixSize; j++)
        {
            results[i] += A[i * matrixSize + j] * V[j];
            // also tried tmp += A[i * matrixSize + j] * V[j];
        }
        // results[i] = tmp;
    }

}

void genRandVector(double *S, unsigned long size)
{
    srand(time(0));
    unsigned long i;

    for (i = 0; i < size; i++)
    {
        double n = rand() % 5;
        S[i] = n;
    }
}

void genRandMatrix(double *A, unsigned long size)
{
    srand(time(0));
    unsigned long i, j;
        for (i = 0; i < size; i++)
        {
            for (j = 0; j < size; j++)
            {
                double n = rand() % 5;
                A[i*size + j] = n;
            }

        }
    }

int main(int argc, char *argv[])
{

    double *V = (double *)malloc(N * sizeof(double));     // v in our A*v = parV computation
    double *parV = (double *)malloc(N * sizeof(double));  // Parallel computed vector
    double *A = (double *)malloc(N * N * sizeof(double)); // NxN Matrix to multiply by V
    genRandVector(V, N);
    doParallelComputation(A, V, parV, N, T);

    free(parV);
    free(A);
    free(V);
    
    return 0;
}
daviegravee
  • 171
  • 2
  • 12
  • Yeah, GCC's auto-vectorizer would be happier if it knew both matrix and vector were aligned by 32 or 64, e.g. from C11 `aligned_alloc`. After inlining, it can probably see that directly without further hints like `__builtin_assume_aligned` – Peter Cordes Oct 11 '20 at 04:52
  • Without `-ffast-math`, I think you need to tell OpenMP about reductions specifically, to let it relax FP-math associativity. That might be why it's not auto-vectorizing the row dot-product that sums `A[i * matrixSize + j] * V[j];`. Hmm, no, seems that's not it either. With `-ffast-math` and with OpenMP commented out (https://godbolt.org/z/qxhGon), still no auto-vec of that. Ah, but adding `double *restrict results` did the trick, to promise non-overlapping input/output. https://godbolt.org/z/qaPh1v – Peter Cordes Oct 11 '20 at 05:05
  • A further question regarding the godbolt links. I posted below that i'm seeing speedup from the accepted answer, but when I look in godbolt [it says that it couldn't vectorise the loop](https://i.imgur.com/5VUIAqH.png). My ability to read and parse assembly isn't great (working on it as I learn parallel computing), but [I don't see any vector function calls in the generated assembly either](https://i.imgur.com/5VUIAqH.png). Where in the assembly does it jump to a block using vector calls? I see there's vector functions in `.L6` in one of the links, but I can't see how it gets there. – daviegravee Oct 11 '20 at 06:26
  • 1
    `doParallelComputation` calls `GOMP_parallel`, passing a pointer to `doParallelComputation._omp_fn.0`. That function contains `vfmadd231pd` (pd = packed double, vs. sd = scalar double). And it's using a YMM register, 32-bytes. There aren't any "vector function calls", only SIMD FP math *instructions* inside the inner loop. If you'd called `exp()` or something then GCC might have vectorized that by calling a SIMD library function, but FMA inlines directly. Re: GCC loop reports: the outer loop isn't vectorized, only parallelized. But GCC noisily talks about everything. – Peter Cordes Oct 11 '20 at 06:50

1 Answers1

3

Adding double *restrict results to promise non-overlapping input/output helped, without OpenMP but with -ffast-math. https://godbolt.org/z/qaPh1v

You need to tell OpenMP about reductions specifically, to let it relax FP-math associativity. (-ffast-math doesn't help the OpenMP vectorizer). With that as well, we get what you want:
#pragma omp simd reduction(+:tmp)


With just restrict and no -ffast-math or -fopenmp, you get total garbage: it does a SIMD FP multiply, but then unpacks that for 4x vaddsd into the scalar accumulator, not helping hide FP latency at all.

With restrict and -fopenmp (without fast-math), it just does scalar FMA.

With restrict and -ffast-math (without -fopenmp or #pragma commented) it auto-vectorizes nicely: vfmadd231pd ymm inside the loop, shuffle / add horizontal sum outside. (But doesn't parallelize). https://godbolt.org/z/f36oG3

With restrict and -ffast-math (with -fopenmp) it still doesn't auto-vectorize. The OpenMP vectorizer is different, and maybe doesn't take advantage of fast-math, instead needing you to tell it about reductions?


Also note that with your data layout, the loop you want to parallelize (outer) is different from the loop you want to vectorize with SIMD (inner). Both the input "vectors" for the inner dot-product loop are in contiguous memory so it makes the most sense to read those, instead of trying to SIMD shuffle data from 4 different columns into one vector to accumulate 4 result[i+0..3] results in 1 vector.

However, unrolling the outer loop by 4 to use each V[j+0..3] with data from 4 different columns would improve computational intensity (closer to 1 load per FMA, rather than 2)

(As long as V[] and a row of the matrix fits in L1d cache, this is good. If not, it's actually pretty bad and should get cache-blocked. Or actually if you unroll the outer loop, 4 rows of the matrix.)


Also note that double tmp = 0; would be a good idea: your current version adds into result[i], reading it before writing. That would require zero-init before you could use it as a pure output.

Auto-vec auto-par version:

I think this is correct; the asm looks like it auto-parallelized as well as auto-vectorizing the inner loop.

void doParallelComputation(double *restrict A, double *restrict V, double *restrict results, unsigned long matrixSize, int numThreads)
{
    omp_set_num_threads(numThreads);
    unsigned long i, j;

    #pragma omp parallel for private(j)
    for (i = 0; i < matrixSize; i++)
    {
        // double *AHead = &A[i * matrixSize];
        double tmp = 0;

         // TODO: unroll outer loop and cache-block it.
        #pragma omp simd reduction(+:tmp)
        for (j = 0; j < matrixSize; j++)
        {
            //results[i] += A[i * matrixSize + j] * V[j];
            tmp += A[i * matrixSize + j] * V[j];  // 
        }
        results[i] = tmp;  // write-only to results, not adding to old value.
    }

}

Compiles (Godbolt) with a vectorized inner loop inside the OpenMPified helper function doParallelComputation._omp_fn.0:

# gcc7.5 -xc -O3 -fopenmp -march=skylake
.L6:
        add     rdx, 1               # loop counter; newer GCC just compares the end-pointer
        vmovupd ymm2, YMMWORD PTR [rcx+rax]            # 32-byte load
        vfmadd231pd     ymm0, ymm2, YMMWORD PTR [rsi+rax]  # 32-byte memory-source FMA
        add     rax, 32                                # pointer increment
        cmp     rdi, rdx
        ja      .L6

Then a horizontal sum of mediocre efficiency after the loop; unfortunately the OpenMP vectorizer isn't as smart as the "normal" -ftree-vectorize vectorizer, but that requires -ffast-math to do anything here.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Very informative answer. I am seeing some pretty respectable speedup when using this in conjunction with OpenMP as problem size grows. [See here](https://i.imgur.com/R8bOhmm.png). I plan on implementing loop tiling in the next couple of days. Regarding loop unrolling: i've tried unrolling it by adding 4 iterations in one pass, but there’s a slight performance hit. [See here](https://i.imgur.com/L2DX4q5.png). What are the important considerations for loop unrolling? – daviegravee Oct 11 '20 at 06:11
  • 1
    @daviegravee: important for unrolling: cache-block it first, with that 4k x 4k size. 4k * 8 bytes per double = 32kiB memory footprint, the size of L1d cache. You can try unrolling with smaller matrix dimensions, so e.g. the vector *and* 4 whole rows of the matrix fit in L1d cache. – Peter Cordes Oct 11 '20 at 06:46
  • Can I confirm that this is the general technique for cache-blocking. [See here](https://godbolt.org/z/6EjbYc). I'm going to revisit this tomorrow. I am getting better speedup, and it appears to be vectorised, but i'm going to have to think a little bit more carefully about how to unroll these loops appropriately. – daviegravee Oct 11 '20 at 11:26
  • @daviegravee: don't use `fmin` for integer min; that hides a lot of information from the compiler and is slower than a normal `x – Peter Cordes Oct 11 '20 at 11:47
  • @daviegravee: related: [What Every Programmer Should Know About Memory?](https://stackoverflow.com/q/8126311) has a cache-blocked matrix multiply example (with manual vectorization). Auto-vectorization may be all you need, especially for matrix-vector, or at least good enough. – Peter Cordes Oct 11 '20 at 11:48
  • Sorry the code example was meant to highlight the tiling, not the actual block size I would use. I'm taking a break on this for a few hours (read: sleep). Really appreciative of the help, i've learned a lot today! – daviegravee Oct 11 '20 at 11:59
  • @daviegravee: Yeah, I understood the main point, I was just pointing out the things I was able to notice without putting in a lot of effort to fully verify you had the logic right. An extra two levels of loop nesting is the right idea, though. – Peter Cordes Oct 11 '20 at 12:00