Automatic vectorization of matrix multiplication

Question

I'm fairly new with SIMD and wanted to try to see if I could get GCC to vectorise a simple action for me.

So I looked at this post and wanted to do more or less the same thing. (but with gcc 5.4.0 on Linux 64bit, for a KabyLake processor)

I essentially have this function:

/* m1 = N x M matrix, m2 = M x P matrix, m3 = N x P matrix & output */
void mmul(double **m1, double **m2, double **m3, int N, int M, int P)
{
    for (i = 0; i < N; i++)
        for (j = 0; j < P; j++)
        {
            double tmp = 0.0;

            for (k = 0; k < M; k++)
                tmp += m1[i][k] * m2[k][j];

            tmp = m3[i][j];
        }
    return m3;
}

Which I compile with -O2 -ftree-vectorize -msse2 -ftree-vectorizer-verbose=5, however I don't see any message that the vectorization was done.

If anyone could help me out, that would be very much appreciated.

`-fopt-info-all-vec` (see doc for details about it) gives more information. One key element in the output of `-fopt-info-missed-vec` is "reduction: unsafe fp math optimization: tmp_40 = _16 + tmp_49;" which essentially means you need -ffast-math (or something slightly weaker) to vectorize. — Marc Glisse, Apr 06 '17 at 07:31

Amiri · Accepted Answer · 2017-04-06T22:38:20.397

There is no message for vectorization done in you command! You can use -fopt-info-vec to turn the vectorization report on. But, do not rely on it. Compiler sometimes lies (They vectorize and report it but don't use it!) you can chek the improvements!For this purpose, you can measure the speedup. First, disable vectorization and measure the time t1. Then enable and measure the time t2. The speed up will be t1/t2 if it's bigger than 1 it says compiler improved if 1 no improvement if less than one it says compiler auto-vectorizer ruined that for you! Another way you can add -S to your command and see the assembly codes in a separated .s file.

NOTE: if you want to see the autovectorization power add -march=native and delete that -msse2.

UPDATE: When you use a variable such a N,M, etc. as the loop counter you might not see vectorization. Thus, you should have used constants instead. In my experience, the matrix-matrix multiplication is vectorizable using gcc 4.8, 5.4 and 6.2. Other compilers such as clang-LLVM, ICC and MSVC vectorize it as well. As mentioned in comments if you use double or float datatypes you might need to use -ffast-math which is an enabled flag in -Ofast optimization level, to say you don't need a high-accuracy result (It's OK most of the times). Its because ompilers are more carful about floting-point operations.

@WorkofArtiz and Martin: `gcc -O3` enables `-ftree-vectorize` and some other things. Note that `-Ofast` is `-O3 -ffast-math`, not `-O2`. Anyway, I'd recommend comparing `-O3 -ffast-math -march=native` vs. `O3 -ffast-math -march=native -fno-tree-vectorize`. — Peter Cordes, Aug 08 '17 at 00:45
Good point about constant vs. variable loop counts. gcc and clang usually can vectorize as long as the loop iteration count is known before the first iteration. (But they can never vectorize search-loops like `while(a[i++] != 2){}`). You're right that they usually do better with compile-time constant loop counts, though. — Peter Cordes, Aug 08 '17 at 00:48

Automatic vectorization of matrix multiplication

1 Answers1

Linked