I decided to play a little bit with AVX. For this reason I wrote a simple matrix multiplication "benchmark code" and started applying some optimizations to it - just to see how fast I can make it. Below is my naive implementation, followed by the simplest AVX one I could think of:
void mmult_naive()
{
int i, j, k = 0;
// Traverse through each row element of Matrix A
for (i = 0; i < SIZE; i++) {
// Traverse through each column element of Matrix B
for (j = 0; j < SIZE; j++) {
for (k = 0; k < SIZE; k++) {
matrix_C[i][j] += matrix_A[i][k] * matrix_B[k][j];
}
}
}
}
AVX:
void mmult_avx_transposed()
{
__m256 row_vector_A;
__m256 row_vector_B;
int i, j, k = 0;
__m256 int_prod;
// Transpose matrix B
transposeMatrix();
for (i = 0; i < SIZE; i++) {
for (j = 0; j < SIZE; j++) {
int_prod = _mm256_setzero_ps();
for (k = 0; k < (SIZE / 8); k++) {
row_vector_A = _mm256_load_ps(&matrix_A[i][k * 8]);
row_vector_B = _mm256_load_ps(&T_matrix[j][k * 8]);
int_prod = _mm256_fmadd_ps(row_vector_A, row_vector_B, int_prod);
}
matrix_C[i][j] = hsum_single_avx(int_prod);
}
}
}
I chose to transpose the second matrix to make it easier to load the values from the memory to the vector registers. This part works fine, gives all the nice expected speed up and makes me happy.
While measuring the execution time for some larger matrix sizes (NxN matrices, N>1024) I thought the transpose might not be necessary if I find a "smarter" way to access the elements. The transpose function itself was roughly a 4-5% of the execution time, so it looked like a low-hanging fruit.
I replaced the second _mm256_load_ps
with the following line and got rid of the transposeMatrix()
:
row_vector_B = _mm256_setr_ps(matrix_B[k * 8][j], matrix_B[(k * 8) + 1][j], matrix_B[(k * 8) + 2][j], matrix_B[(k * 8) + 3][j],
matrix_B[(k * 8) + 4][j], matrix_B[(k * 8) + 5][j], matrix_B[(k * 8) + 6][j], matrix_B[(k * 8) + 7][j]);
But now the code runs even worse! The results I got are the following:
MMULT_NAIVE execution time: 195,499 us
MMULT_AVX_TRANSPOSED execution time: 127,802 us
MMULT_AVX_INDEXED execution time: 1,482,524 us
I wanted to see if I had better luck with clang but it only made things worse:
MMULT_NAIVE execution time: 2,027,125 us
MMULT_AVX_TRANSPOSED execution time: 125,781 us
MMULT_AVX_INDEXED execution time: 1,798,410 us
My questions are in fact two: Why the indexed part runs slower? What's going on with clang? Apparently, even the "slow version" is much slower.
Everything was compiled with -O3
, -mavx2
and -march=native
on an i7-8700, with Arch Linux. g++ was in version 12.1.0 and clang in version 14.0.6.