Is memory a bottleneck in matrix addition (SIMD Instructions)?

Question

I'm trying to optimize 2d matrix addition in C using SIMD instructions (_mm256_add_pd, store, load, etc.). However, I'm not seeing a large speedup at all. Using some timing code, I'm seeing speedup in the range of .8x-1.5x the naive solution). I was wondering if this is typical at all? I was thinking it could potentially be a memory bottleneck, as the computation seems to be very little in this case. I believe this should give me around a 4x boost in speed, since I'm doing 4 additions at once, so I'm not totally sure what the bottleneck is.

I made some code to demonstrate what I'm doing (testing parallel + SIMD vs just SIMD):

#include <stddef.h>
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include <time.h>
#include <omp.h>
#include <string.h>

#if defined(_MSC_VER)
#include <intrin.h>
#elif defined(__GNUC__) && (defined(__x86_64__) || defined(__i386__))
#include <immintrin.h>
#include <x86intrin.h>
#endif

void add_matrix_naive (double **result, double **mat1, double **mat2, int rows, int cols) {
    int simdCols = cols / 4 * 4;
        if(simdCols > 0){
            for(unsigned int i = 0; i < rows; i++){
                for(unsigned int j = 0; j < simdCols; j += 4){
                    _mm256_storeu_pd(result[i] + j, _mm256_add_pd(
                        _mm256_loadu_pd(mat1[i] + j)
                        , _mm256_loadu_pd(mat2[i] + j)));
                }
            }
        }

        //Handle extra columns
        if(simdCols < cols){
            for(unsigned int i = 0; i < rows; i++){ 
                for(unsigned int j = simdCols; j < cols; j++){
                    result[i][j] = mat1[i][j] + mat2[i][j];
                }
            }
        }
}

void add_matrix(double **result, double **mat1, double **mat2, int rows, int cols) {
    int simdCols = cols / 4 * 4;
    #pragma omp parallel if (rows*cols >= 2000)
    {
        if(simdCols > 0){
            #pragma omp for collapse(2)
            for(unsigned int i = 0; i < rows; i++){
                for(unsigned int j = 0; j < simdCols; j += 4){
                    _mm256_storeu_pd(result[i] + j, _mm256_add_pd(
                        _mm256_loadu_pd(mat1[i] + j)
                        , _mm256_loadu_pd(mat2[i] + j)));
                }
            }
        }

        //Handle extra columns
        if(simdCols < cols){
            #pragma omp for collapse(2)
            for(unsigned int i = 0; i < rows; i++){ 
                for(unsigned int j = simdCols; j < cols; j++){
                    result[i][j] = mat1[i][j] + mat2[i][j];
                }
            }
        }

    }
}

int main() 
{ 
    omp_set_num_threads(8);
    //Allocate Matrices
    int rows = 200;
    int cols = 200;

    double **matrix_a = malloc(rows * sizeof(double *) + rows*cols*sizeof(double));

    double * dataStart = (double *) matrix_a + rows; //Offset row pointers
    for(unsigned int i = 0; i < rows; i++){
        matrix_a[i] = dataStart + i * cols;
        memset(matrix_a[i], 0, sizeof(double) * cols);
    }

    double **matrix_b = malloc(rows * sizeof(double *) + rows*cols*sizeof(double));

    dataStart = (double *) matrix_b + rows; //Offset row pointers
    for(unsigned int i = 0; i < rows; i++){
        matrix_b[i] = dataStart + i * cols;
        memset(matrix_b[i], 0, sizeof(double) * cols);
    }

    double **result = malloc(rows * sizeof(double *) + rows*cols*sizeof(double));

    dataStart = (double *) result + rows; //Offset row pointers
    for(unsigned int i = 0; i < rows; i++){
        result[i] = dataStart + i * cols;
        memset(result[i], 0, sizeof(double) * cols);
    }

    //Assign random values to matrices.
    for(int i = 0; i < rows; i++){
        for(int j = 0; j < cols; j++){
            matrix_a[i][j] = rand();
            matrix_b[i][j] = rand();
        }
    }

    
    int LOOP_COUNT = 4;

    double prevTime = omp_get_wtime();
    for(int i = 0; i < LOOP_COUNT; i++){
        add_matrix(result, matrix_a, matrix_b, rows, cols);
        
    }
    double endTime = omp_get_wtime();
    double firstTime = (endTime - prevTime)/LOOP_COUNT;
    printf("Took %f Seconds\n", firstTime);

    //Assign random values to matrices.
    for(int i = 0; i < rows; i++){
        for(int j = 0; j < cols; j++){
            matrix_a[i][j] = rand();
            matrix_b[i][j] = rand();
        }
    }

    prevTime = omp_get_wtime();
    for(int i = 0; i < LOOP_COUNT; i++){
        add_matrix_naive(result, matrix_a, matrix_b, rows, cols);
    }
    endTime = omp_get_wtime();
    double secondTime = (endTime - prevTime)/LOOP_COUNT;
    printf("Took %f Seconds\n", secondTime);
    printf("Naive Time: %f Faster\n", firstTime/secondTime);
}

Something I've noticed is that the result seems quite depending on the LOOP_COUNT. With a high loop count the parallel/SIMD version does quite well, but with lower loop counts the naive solution tends to do better.

It depends on your SIMD code, your original C code that you're comparing against, and your benchmarking code. Please _edit_ your question and post your code. Note that a good optimizer may already be using SIMD when it optimizes your base/C code. — Craig Estey, Nov 21 '20 at 22:55
Unfortunately, this is for a school assignment and I'm not allowed to post my code online. I was just wondering if from a theoretical perspective, memory would be a bottleneck in this case or if my code could be further improved. I'm using omp_get_wtime() to benchmark, and adding two 500x500 matrices 200 times. I do this for each function and look at the difference. However, benchmarking this seems difficult as even when I do this with the same function, the difference is not always 1x, it seems to be pretty variable. — AS425, Nov 21 '20 at 23:03
If you compile your naïve C code with `-O3` on gcc/clang they will likely be able to vectorize that as well (take a look at the generated assembly code). — chtz, Nov 21 '20 at 23:49
"I'm not allowed to post my code online" translates to "I have this problem with this thing" which means we probably can't help. We need more specifics. We need code that *we can use to reproduce the problem*. — tadman, Nov 22 '20 at 01:44
2 loads and a store per ALU operation is very low computational intensity, thus yes any decently-optimized array addition should bottleneck on memory bandwidth, or the front-end cost of doing those load and store operations. Even without auto-vectorization, multiple threads should allow you to saturate memory bandwidth, especially with `double` where each scalar element is 8 bytes wide. If the data is hot in L3 cache, that might be enough to *not* bottleneck on memory bandwidth without SIMD instructions; it depends a lot on your CPU. But obviously depends strongly on your compiler / options. — Peter Cordes, Nov 22 '20 at 01:45
But without code or any description of details to talk about, this isn't a useful question to answer for the benefit of future readers. — Peter Cordes, Nov 22 '20 at 01:46
@PeterCordes I went through and made some example code to show how I am testing (I'm not sure if I'm benchmarking correctly). Thank you for the help already as well! — AS425, Nov 22 '20 at 02:59
Ugh, why are you using an array of pointers to arrays, instead of an single efficient 2D array? [A different way to malloc a 2D array?](https://stackoverflow.com/q/17389009). That's going to make it harder for compilers to prove or check that there's no aliasing (i.e. that no output rows point to the same storage as some input rows). — Peter Cordes, Nov 22 '20 at 03:02
@PeterCordes I have to be able to make subslices of these matrices from the array. (This is for a python library). So for example matrix_a[4:5]. I believe what I'm doing still would have a contiguous block of memory for the data, similar to the post you sent? — AS425, Nov 22 '20 at 03:05
It's not particularly more complicated. A C 2D array is an array of arrays; you can offset into that to take a subset of the rows. Or just use flat 1D arrays with manual 2D indexing, so you can SIMD without caring about the ends of rows. e.g. a 10k x 3 matrix can do 1.333 rows per vector of 4 elements. You can do 2D slicing with this: you can keep track of the row stride (distance between consecutive rows) separately from actual width and height to process. (In image processing, this lets you easily refer to a rectangle of pixels somewhere in a larger image without having to copy them out.) — Peter Cordes, Nov 22 '20 at 03:10
To do the same thing with arrays of pointers, you need to pass a start *and* end column to use for each row, as well as a starting row and number of rows. So that's 4 integer offsets instead of just 3 (with manual 1D indexing) to take a 2D sub-rectangle somewhere inside the image. If you want to do slicing, I'd seriously recommend doing your own `array[IDX2D(x,y, stride)]` where IDX2D is something like `((x) + (y)*(stride))` as a macro. — Peter Cordes, Nov 22 '20 at 03:13
Anyway, you now have code, but you haven't included specific results, or the CPU / RAM you tested on, or the OS / compiler / version / options you used. Clearly optimization options can make a big difference, like auto-vectorization with `-O3 -maxv` or `-march=native`, since you have to enable AVX anyway to use the intrinsics with compilers like GCC and clang. Especially if it's GCC, where the default tuning (without `-march=native`) might split 256-bit loads/stores that aren't known to be aligned ([Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd?](//stackoverflow.com/q/52626726)). — Peter Cordes, Nov 22 '20 at 03:18
Also note, if your results are sensitive to repeat-count, you might be seeing warm-up effects. It looks like you touch all your memory first (with memset, if that doesn't get optimized into `calloc`), so you're probably not paying page-fault costs in the timed region. But see [Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987) for various pitfalls. — Peter Cordes, Nov 22 '20 at 03:19
Woldn't that prevent me from using _mm256_loadu_pd, since I'd need a macro for each individual element? (or I'd need to somehow check if 4 elements are contiguous). I don't fully understand the distinction between what I'm doing, and the post you linked. Does defining arrays like that only allocate for the data, and the compiler just remembers the offsets so that you can index with arr[x][y]? I'm currently using GCC with "-fopenmp -mavx -mfma". I did look for benchmarking libraries but I couldn't find anything simple out there for C (I'm still a beginner so don't understand a ton). — AS425, Nov 22 '20 at 03:25
If you're still looking for some [simple] test and benchmarking suite, see my recent answer here: https://stackoverflow.com/questions/64917205/how-to-optimize-find-all-possible-triangles-without-duplicates-code/64922786#64922786 — Craig Estey, Nov 22 '20 at 19:56

score 1 · Answer 1 · answered Nov 25 '20 at 23:57

CPUs can compute things way faster than they can access memory.

Adding doubles is very fast. Your core is very likely bottlenecked on memory.

There's more.

omp_set_num_threads

Rarely a good idea. The default is usually good.

double **result, double **mat1, double **mat2

Double pointers means double RAM latency to access them. Use a single pointer. If you want to use slices of the matrix, just pass another parameter with offset between rows of the matrix.

But the most important thing there, your benchmarking is completely wrong.

To compare different implementations of an algorithm, you must compile 2 different programs. When you run both implementations in a single program, many interesting things happen: cache hierarchy kicks in, CPUs have another dedicated cache for decoded instructions, modern CPUs are extremely fast at thermal throttling, and more.
Compilers are not stupid. When they realize you compute something without using the result, they can drop the code. When they realize you computing something multiple times without changing input data, they can eliminate the redundant code and only compute it once.

Is memory a bottleneck in matrix addition (SIMD Instructions)?

1 Answers1