I'm benchmarking different matrix multiply forms with different optimization levels (for teaching purposes) and I detected a strange behavior in gcc autovectorization. It fails to vectorize when arrays are parameters (see mxmp) but is able to vectorize when arrays are global variables (see mxmg)
gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1) but behaviour was the same with older gcc versions
Compiling options: gcc -O3 -mavx2 -mfma
#define N 1024
float A[N][N], B[N][N], C[N][N];
void mxmp(float A[N][N], float B[N][N], float C[N][N]) {
int i,j,k;
for (i=0; i<N; i++)
for (j=0; j<N; j++)
for (k=0; k<N; k++)
C[i][j] = C[i][j] + A[i][k] * B[k][j];
}
void mxmg() {
int i,j,k;
for (i=0; i<N; i++)
for (j=0; j<N; j++)
for (k=0; k<N; k++)
C[i][j] = C[i][j] + A[i][k] * B[k][j];
}
main(){
mxmg();
mxmp(A, B, C);
}
I expected the compiler to do the same in both functions however mxmp requires about 10 times the execution time of mxmg. Exploring the assembly code it just happens that gcc is able to autovectorize mxmg (when arrays are global variables) but fails to vectorize mxmp (where arrays are parameters).
Tried the same with kij form and it's able to vectorize both functions.
I need help to discover why gcc has this behavior. And how to help gcc (pragmas, compile options, atributes, ...) to properly vectorize mxmp function. Thanks