Speed up AVX large matrix multiplication even more? Avoiding stores in the inner loop?

Question

I want to speedup the matrix multiplication in C. I have tried to use several methods, such as OpenMP, SIMD, and cache friendliness to optimize it, and now the speed can reach 65x.

The reason I get relatively low speedup is that I use _mm256_storeu_pd() in the inner most loop and memory writing is expensive.

Can anyone give me some ideas how to avoid using expensive memory writing in the inner most loop so that I can further optimize the code?

void mul_matrix(matrix *result, matrix *mat1, matrix *mat2){
    int I = mat1->rows;
    int J = mat2->cols;
    int K = mat2->rows;
    #pragma omp parallel for
    for(int i = 0; i < I; i++){
        for(int k = 0; k < K; k++){
            _m256d vA = _mm256_set1_pd(mat1->data[i * K + k]);
            for(int j = 0; j < J / 4 * 4; j += 4){
                _m256d sum = _mm256_loadu_pd(result->data + i * J + j);
                _m256d vB = _mm256_loadu_pd(mat2->data + k * J + j);
                _m256d intermediate = _mm256_mul_pd(vA, vB);
                sum = _mm256_add_pd(sum, intermediate);
                _mm256_storeu_pd(result->data + i * J + j, sum);
             }
             for(int x = J / 4 * 4; x < J; x++){
                 result->data[i * J + x] += mat1 -> data[i * K + k] * mat2 -> data[k * J + x];
             }
         }
     }
}
typedef struct matrix{
    int rows;
    int cols;
    double* data;
}matrix;

What size of matrix are you optimizing for? It doesn't look like this is cache-blocked, so yes it's expensive for a large matrix *if* you don't soon reuse that data while it's still hot in L1d cache. See [What Every Programmer Should Know About Memory?](https://stackoverflow.com/a/47714514) for an example of a cache-blocked SIMD matmul (but using the same add-to-destination strategy that leads to 2 loads and 1 store per FMA, or per mul+add if you don't have FMA). — Peter Cordes, Aug 07 '20 at 01:23
@DanielWalker: the SIMD tag is not "extraneous" here! Needing to load and store contiguous blocks of data is a key part of what constrains the optimization choices when cache-blocking. — Peter Cordes, Aug 07 '20 at 01:26
@Charmaine: 65x vs. what? A naive scalar version that does one row x column dot product at a time, bottlenecking on cache-misses and/or FP add latency? Both versions compiled with `gcc -O3 -march=native`, I hope? — Peter Cordes, Aug 07 '20 at 01:32
@yao99: swapping two inner for loops will definitely slow down the performance. In ijk implementation, less old data in double array is referenced so there will be more cache miss — errorcodemonkey, Aug 07 '20 at 01:33
If anyone is interested in simple step by step matrix multiplication optimization there is a good video from MIT: https://www.youtube.com/watch?v=o7h_sYMk_oc — Jakub Biały, Aug 07 '20 at 08:31

Speed up AVX large matrix multiplication even more? Avoiding stores in the inner loop?

0 Answers0