0

I decided to play a little bit with AVX. For this reason I wrote a simple matrix multiplication "benchmark code" and started applying some optimizations to it - just to see how fast I can make it. Below is my naive implementation, followed by the simplest AVX one I could think of:

void mmult_naive()
{
    int i, j, k = 0;
    // Traverse through each row element of Matrix A
    for (i = 0; i < SIZE; i++) {
        // Traverse through each column element of Matrix B
        for (j = 0; j < SIZE; j++) {

            for (k = 0; k < SIZE; k++) {
                matrix_C[i][j] += matrix_A[i][k] * matrix_B[k][j];
            }
        }
    }
}

AVX:

void mmult_avx_transposed()
{
    __m256 row_vector_A;
    __m256 row_vector_B;
    int i, j, k = 0;

    __m256 int_prod;

    // Transpose matrix B
    transposeMatrix();

    for (i = 0; i < SIZE; i++) {
        for (j = 0; j < SIZE; j++) {
            int_prod = _mm256_setzero_ps();
            for (k = 0; k < (SIZE / 8); k++) {
                row_vector_A = _mm256_load_ps(&matrix_A[i][k * 8]);
                row_vector_B = _mm256_load_ps(&T_matrix[j][k * 8]);
                int_prod = _mm256_fmadd_ps(row_vector_A, row_vector_B, int_prod);
            }
            matrix_C[i][j] = hsum_single_avx(int_prod);
        }
    }
}

I chose to transpose the second matrix to make it easier to load the values from the memory to the vector registers. This part works fine, gives all the nice expected speed up and makes me happy.

While measuring the execution time for some larger matrix sizes (NxN matrices, N>1024) I thought the transpose might not be necessary if I find a "smarter" way to access the elements. The transpose function itself was roughly a 4-5% of the execution time, so it looked like a low-hanging fruit.

I replaced the second _mm256_load_ps with the following line and got rid of the transposeMatrix():

row_vector_B = _mm256_setr_ps(matrix_B[k * 8][j], matrix_B[(k * 8) + 1][j], matrix_B[(k * 8) + 2][j], matrix_B[(k * 8) + 3][j],
                    matrix_B[(k * 8) + 4][j], matrix_B[(k * 8) + 5][j], matrix_B[(k * 8) + 6][j], matrix_B[(k * 8) + 7][j]);

But now the code runs even worse! The results I got are the following:

MMULT_NAIVE execution time: 195,499 us
MMULT_AVX_TRANSPOSED execution time: 127,802 us
MMULT_AVX_INDEXED execution time: 1,482,524 us

I wanted to see if I had better luck with clang but it only made things worse:

MMULT_NAIVE execution time: 2,027,125 us
MMULT_AVX_TRANSPOSED execution time: 125,781 us
MMULT_AVX_INDEXED execution time: 1,798,410 us

My questions are in fact two: Why the indexed part runs slower? What's going on with clang? Apparently, even the "slow version" is much slower.

Everything was compiled with -O3, -mavx2 and -march=native on an i7-8700, with Arch Linux. g++ was in version 12.1.0 and clang in version 14.0.6.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Arkoudinos
  • 1,099
  • 12
  • 20
  • Without the transpose, you're going back to a disastrous cache-access pattern. Doing a manual gather can't avoid that. `_mm256_setr_ps` is not fast, either; it has to load and shuffle together all 8 floats. BTW, see [How much of ‘What Every Programmer Should Know About Memory’ is still valid?](https://stackoverflow.com/q/8126311) for cache-blocked matmul that avoids transpose by accumulating in a different order. – Peter Cordes Jul 11 '22 at 00:52
  • (It also avoids the long `_mm256_fmadd_ps` dep chain you're creating through one `int_prod`, since you didn't unroll with multiple accumulators. See [Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators)](https://stackoverflow.com/q/45113527) - your TRANSPOSED version could be working on 2 to 8 `j` values at once, reusing one `i` load for that many `j` vectors gaining *computational intensity* (more than 1 FMA per 2 loads; 2:2 ratio is ideal), as well as overlapping the FMA latencies to keep the pipeline full. – Peter Cordes Jul 11 '22 at 00:57
  • IDK why there'd be such a big difference in the naive versions between GCC and clang. You'd have to look at the asm to see if clang did something dumb, or GCC did something great. And be suspicious about it optimizing away some of the work if it was in a repeat loop. – Peter Cordes Jul 11 '22 at 00:59
  • 3
    I know it probably doesn't answer your question. But I think you should read [Optimized matrix multiplication in C](https://stackoverflow.com/q/1907557/13419694) and [How does BLAS get such extreme performance?](https://stackoverflow.com/q/1303182/13419694) – Mateo Jul 11 '22 at 02:19
  • Note that for large matrixes there are more efficient algorithms using divide&conquer methods bringing the runtime down from `O(n^3)` to about `O(n^2.3)`. At `N>1024` that gains you far more than any AVX optimizing can gain you. Splitting the problem into smaller matrixes improves caches too. – Goswin von Brederlow Jul 11 '22 at 02:19
  • clang might not see that the matrixes don't alias and do a load from memory and store to memory on every loop. You should pull `matrix_C[i][j]` out of the loop like you did with AVX. – Goswin von Brederlow Jul 11 '22 at 02:22
  • @PeterCordes Thank you for your time and your suggestions. I was suspecting some cache access issues but now I think I'll have to dive more into it. I'm looking currently at the assembly produced for the naive versions - in case of having any updates I'll let you know. – Arkoudinos Jul 11 '22 at 18:32
  • @Mateo I think you're right :) – Arkoudinos Jul 11 '22 at 18:38
  • @GoswinvonBrederlow I tried your suggestion really quickly on the naive implementation and recompiled with clang. Still terrible, but a little better than before: `MMULT_NAIVE execution time: 1467764us` – Arkoudinos Jul 11 '22 at 18:41

0 Answers0