C++ Matrix Multiplication Auto-Vectorization

Question

I have auto-vectorization enabled. When I compile the code, I receive the following warning:

info C5002: loop not vectorized due to reason '1203'

MSDN specifies that the

Loop body includes non-contiguous accesses into an array.

I've look into these links, 1, 2, for help, but have had no luck.

Here is my source code:

for (int row = 0; row < size; ++row) {
    for (int col = 0; col < size; ++col) {
        float tmp = 0;
        for (int i = 0; i < size; ++i) { // This loop generates the warning above
            tmp += matrixA[row][i] * matrixB[i][col];
        }
        matrixResult[row][col] = tmp;
    }
}

Any help is welcomed.

C++ 2D arrays are arranged in memory as a 1D array row1, row2, etc. This expression `matrixB[i][col]` causes the indexing to jump around in the array. This expression `matrixA[row][i]` does not. — Richard Critten, Apr 20 '17 at 07:29
How are your matrices defined/allocated? If they're `double **`, then you'll also get poor performance due to cache locality issues. — NoseKnowsAll, Apr 20 '17 at 07:32
Transpose B first (and swap the indices), so that you get contiguous access. — Paul R, Apr 20 '17 at 07:50
Just to clarify: there were some answers (and MSFT compiler message kinda implies the same) that it is IMPOSSIBLE to vectorize (with compiler) given loop in case or different access order and due to non-contiguous (non-unit) stride. This is basically WRONG. It is possible to vectorize the code as is, however (a) on many platforms it could be non-profitable (slow-down possible), in particular w/o effective gather instructions intoduction, (b) some vectorizers in some compilers are possibly non-capable to vectorize such code, but gcc/icc should be able to explicitely vectorize it with omp4.x — zam, Apr 20 '17 at 16:35
But my comment above is naturally just for the sake of completeness and doesn't change the key message, which is: it's inefficient to keep memory access in this code as is. — zam, Apr 20 '17 at 16:37

score 2 · Answer 1 · answered Apr 20 '17 at 08:06

2D arrays are stored as a single contiguous block of memory, so a 3x2 element 2D array is actually a 6 elements laid out end to end.

The [] indexing operators simply calculate which element to access.

So what's happening here is that matrixA is being accessed from element 1 through to element 6 sequentially (ie A1, A2, A3, B1, B2, B3).

matrixB however, is being accessed 'randomly', A1, B1, A2, B2 etc which maps onto the actual storage as accessing elements 1 then 4 then 2 then 5.

You can't change the order you access the elements of matrixB, but you could transpose it so that the elements are in the right order to be accessed sequentially. Obviously, if you only do this multiplication once, it might not be worth the effort to re-calculate matrixBs ordering, but if you are performing this calculation repeatedly, then the effort will be very much worth it.

matmul is O(n^3) work over O(n^2) data, so it's not obvious that it isn't worth transposing, which costs O(n^2) time. But actually you don't need to; one trick is to rearrange the looping so you don't do all the work for each row x column dot product. e.g. loop over contiguous elements in the output while you loop over contiguous elements of the input. See the appendix in https://www.akkadia.org/drepper/cpumemory.pdf for an SSE2 vectorize version. (also [How much of ‘What Every Programmer Should Know About Memory’ is still valid?](//stackoverflow.com/a/47714514)). But it's not ideal either — Peter Cordes, Apr 02 '23 at 05:22

score 1 · Answer 2 · answered Apr 20 '17 at 07:47

1

If matrix A and B have the same storage order (e.g. row major), then you cannot vectorize it anyway. So that makes the warning plausible.

Just an advice here: if you want serious high performance computing then you should give up on 2D arrays. The gain in caching is way bigger than the vectorization speed up.

answered Apr 20 '17 at 07:47

Armen Avetisyan

1,140
10
29

A compiler won't *auto*-vectorize for you, but you can manually vectorize. See for example the appendix of https://www.akkadia.org/drepper/cpumemory.pdf (and [How much of ‘What Every Programmer Should Know About Memory’ is still valid?](https://stackoverflow.com/a/47714514)) – Peter Cordes Apr 02 '23 at 03:56

Marius D · Answer 3 · 2017-04-20T09:10:40.753

One way to reach contiguous access: you can swap the inner two loops. Instead of for row, for col, for i you have for row, for i, for col. See the resulted code bellow. Now the access of both matrixResult and matrixB is along col, so it is contiguous.

for (int row = 0; row < size; ++row) {
    for (int i = 0; i < size; ++i) {
        int a_row_i = matrixA[row][i];
        for (int col = 0; col < size; ++col) {
            matrixResult[row][col] += a_row_i * matrixB[i][col];
        }
    }
}

mostafa.elhoushi · Answer 4 · 2023-04-02T05:10:48.730

1

This worked with me:

#define N 1000

void example(int A[N][N], int B[N][N], int C[N][N]) {
    for (int i=0; i<N; i++) {
        for (int j=0; j<N; j++) {
            int res = 0;
            #pragma clang loop vectorize(enable) vectorize_width(8)
            for (int k=0; k<N; k++) {
                res += A[i][k] * B[j][k];
            }
            C[i][j] = res;
        }
    }
}

See https://godbolt.org/z/K4foqd37T

UPDATE (Thanks to @Peter Cordes): that works if one input matrix is already transposed,

edited Apr 02 '23 at 05:10

answered Apr 01 '23 at 21:07

mostafa.elhoushi

3,758
1
17
10

1

That works if one input matrix is already transposed, making the inner-most loop just a dot product over two contiguous vectors. You might want to cache-block some, but that's a reasonable start. – Peter Cordes Apr 02 '23 at 02:57

C++ Matrix Multiplication Auto-Vectorization

4 Answers4