Runtime of standard matrix multiplication algorithm slower than expected when scaled up (2^10+ x 2^10+ elements)

Question

I have a standard matrix multiplication algorithm.

// matrix.h

typedef struct Matrix {
    int rows;
    int columns;
    double *elements;
} Matrix;

double *addressAt(Matrix *matrix, int row, int column);
double *rowAt(Matrix *matrix, int row);
double dotProductColumnArray(Matrix *matrix, const int column, double *array);
void mmProduct(Matrix *first, Matrix *second, Matrix *result);

// matrix.c

// Helper function to find address of elements at row and column.
inline double *addressAt(Matrix *matrix, int row, int column) {
    return &(matrix->elements[row * matrix->columns + column]);
}

// Helper function to find row array
inline double *rowAt(Matrix *matrix, int row) {
    return &(matrix->elements[row * matrix->columns]);
}

// Finds the dot product of a column of a matrix and an array.
double dotProductColumnArray(Matrix *matrix, const int column, double *array) {
    double sum = 0.0;
    const int rows = matrix->rows, columns = matrix->columns;
    int j = column;
    const double *elements = matrix->elements;
    for (int i = 0; i < rows; i++) {
        sum += array[i] * elements[j];
        j += columns;
    }
    return sum;
}

void mmProduct(Matrix *first, Matrix *second, Matrix *result) {
    const int rows = result->rows;
    const int columns = result->columns;
    for (int i = 0; i < rows; i++) {
        double *row = rowAt(first, i);
        for (int j = 0; j < columns; j++) {
            *addressAt(result, i, j) = dotProductColumnArray(second, j, row);
        }
    }
}

// Snippet of main.c

// Fills the matrix with random numbers between -1 and 1
void randomFill(Matrix *matrix);

int main() {
    struct timeval timestamp;
    long start, now;
    Matrix first, second, result;
    
    // 2^10
    first = createMatrix((int)pow(2, 10), (int)pow(2, 10));
    second = createMatrix((int)pow(2, 10), (int)pow(2, 10));
    randomFill(&first);
    randomFill(&second);
    result = createMatrix((int)pow(2, 10), (int)pow(2, 10));

    gettimeofday(&timestamp, NULL);
    start = timestamp.tv_sec * 1000 + timestamp.tv_usec / 1000;
    mmProduct(&first, &second, &result);
    gettimeofday(&timestamp, NULL);
    now = timestamp.tv_sec * 1000 + timestamp.tv_usec / 1000;
    printf("Time taken: %ldms\n", now - start);
    
    deleteMatrix(&first);
    deleteMatrix(&second);
    deleteMatrix(&result);
    
    // Same code as above but for 2^11, 2^12, replacing pow() when necessary.
    // ...
    // ...
}

Full code here to run it yourself: https://gist.github.com/kevinMEH/c37bda728bf5ee53c32a405ef611e5a7

When I run the algorithm with two 2¹⁰ x 2¹⁰ matrices, it takes about 2 seconds to multiply.

When I run the algorithm with two 2¹¹ x 2¹¹ matrices, I expect the algorithm to multiply them in 8 * 2 seconds ~ 16 seconds to multiply. However, it takes 36 seconds to multiply instead.

When I run the algorithm with two 2¹² x 2¹² matrices, I expect the algorithm to multiply them in 64 * 2000 milliseconds ~ 128 seconds to multiply. Worst case, I expect 8 * 36 = 288 seconds to multiply. However, it takes 391 seconds to multiply.

(Compiled with Apple Clang 14.0.0 with -O1, but with -Ofast and with no optimizations it scales similarly, still not by a factor of 8 but greater. Ran on Apple M1 chip.)

I'm perplexed as to why this is. The number of additions and multiplications operations performed is exactly (2ⁿ)³, and when I increase the width of the matrices by a factor of 2, I expect the subsequent multiplication operation to only take 2³ = 8 times longer each time.

Suspects:

The addressAt and rowAt helper functions is contributing to the slowdown: This does not make sense; the number of additions and multiplications scale by n³, but calls to addressAt only scale by n² and rowAt only scale by n, so we should actually see a relative increase in speed. (8 * (n³ + n² + n) > 8n³ + 4n² + 2n) I've also manually inlined the functions, and found there to be no noticeable speed increase, so it should not be because of the function call overheads.

Subnormal numbers: My random number generator gets a random u64, and divides it by 2 * 0x8..0u, yielding a number between 0 and 1, so there should be no subnormal numbers. Nonetheless, in the randomFill function I multiplied everything by 100000 and it was still the same result.

dotProductColumnArray(): I suspect that this is the culprit, but I cannot find a reason why. My best guess is that the slowdown is because of CPU prefetching the wrong column elements. When we are incrementing j by the number of columns, the CPU is confused by this and does not prefetch the array elements at matrix->elements[j + n * columns] because it did not expect j to be incremented by columns (so like it prefetches j, j + 1, or whatever, but not j + columns), but that does not make sense to me as columns is constant, so the CPU should know better.

Side note:

I also wrote mmStrassen() which implements the Strassen matrix multiplication algorithm, and when I tested it, it was expectedly 7 times longer each time. It was tested together following the regular mmProduct function, so it is not a running time issue.

Also any optimization recommendations is welcome, thank you.

Thanks and apologies in advance for the long question.

"That does not make sense to me as columns is constant, so the CPU should know better." The CPU stride prefetcher [can prefetch forward or backward and can detect strides of up to 2K bytes](https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Hardware-prefetch-and-shared-multi-core-resources-on-Xeon/m-p/1074003#:~:text=This%20prefetcher%20can%20prefetch%20forward%20or%20backward%20and%20can%20detect%20strides%20of%20up%20to%202K%20bytes.). Your stride is bigger than 2KB. — Raymond Chen, Mar 22 '23 at 04:25
"[The impact of cache locality on performance in C through matrix multiplication](https://levelup.gitconnected.com/c-programming-hacks-4-matrix-multiplication-are-we-doing-it-right-21a9f1cbf53)" - shows alternate algorithm that has better locality. — Raymond Chen, Mar 22 '23 at 04:31
@kmeh, BITD, simplistic compilers benefitted with walking the array from `n-1` to 0 as checking `i >= 0` is simpler/faster than `i < n`. Yet I have doubt that micro-optimization holds today. — chux - Reinstate Monica, Mar 22 '23 at 04:31
You can get a considerable speedup simply by transposing one of the matrices and then computing the row-product. And that even includes the additional overhead of allocating a new temporary matrix and computing the transpose. — paddy, Mar 22 '23 at 04:59
Three ways to figure out what's going on: profile, profile, profile. — n. m. could be an AI, Mar 22 '23 at 06:10
[Relevant](https://stackoverflow.com/questions/14532169/why-is-a-na%c3%afve-c-matrix-multiplication-100-times-slower-than-blas/14532446?r=SearchResults&s=6%7C15.8865#14532446). — Eric Postpischil, Mar 22 '23 at 10:15
@n.m.: Profiling can help reveal which parts of code are taking time. It does not help when the code has a single core loop that is taking very different amounts of time due to cache effects, as it just reports what is already known: Most of the time is spent in the single core loop. — Eric Postpischil, Mar 22 '23 at 10:18
@EricPostpischil Known to whom? OP suspects `addressAt` and `rowAt`, profiling will help rule those out. Profiling also can help pinpoint the specific places within the loop, although it's harder to interpret the results since the loop can be unrolled. Still, it should be possible to see that the killer is the column traversal. — n. m. could be an AI, Mar 22 '23 at 10:33
@n.m.: Re “profiling will help rule those out”: Really? Or is profiling what led OP to falsely suspect them in the first place? — Eric Postpischil, Mar 22 '23 at 10:40
@EricPostpischil Maybe, but then they are using a broken profiler. Mine shows that 99.98% of time is spent in the line `sum += array[i] * elements[j];`. — n. m. could be an AI, Mar 22 '23 at 15:10
@n.m.: And, so? As I said, “it just reports what is already known: Most of the time is spent in the single core loop.” OP already suspected that and is pursuing a reason why. Profiling does not help with that. — Eric Postpischil, Mar 22 '23 at 18:21
Thanks everyone and @RaymondChen, that article and the information about the prefetcher was very helpful. I have posted an analysis with modified functions based on the article. — kmeh, Mar 23 '23 at 02:43

chux - Reinstate Monica · Answer 1 · 2023-03-22T04:26:40.460

Candidate improvement: use restrict, const.

restrict allows the compiler to assume changing matrix[] does not affect array[] (i.e. - they do not overlap) and optimize accordingly. Without restrict, compiler must assume changing matrix[] may change array[] and necessitate doing everything one at a time, instead of some parallelism.

const might have some marginal benefit too.

// double dotProductColumnArray(
//     Matrix* matrix, const int column, double* array) {
double dotProductColumnArray(
    Matrix* restrict matrix, const int column, const double* restrict array) {

Likewise

// void mmProduct(Matrix* first, 
//     Matrix* second, Matrix* result) {
void mmProduct(const Matrix* restrict first, 
    const Matrix* restrict second, Matrix* restrict result) {

I'm afraid this is not enough unless the `Matrix` `elements` pointer is also `restrict` qualified, which is incompatible with using `rowAt` and `columnAt` functions and local pointers. — chqrlie, Mar 22 '23 at 06:09

chqrlie · Answer 2 · 2023-03-22T07:13:40.287

The matrix dimensions are exact powers of 2, which causes catastrophic cache behavior as reading the column values causes cache collisions almost every time.

You should achieve much faster execution and the expected complexity with a simple alternative: transpose the second matrix and implement transposed matrix multiplication where each coefficient of the destination matrix is the dot product of 2 matrix rows.

Here is a simplistic implementation:

void mmProduct2(Matrix *first, Matrix *second, Matrix *result) {
    Matrix t = createMatrix(second->columns, second->rows);
    transpose(second, &t);
    int resultIndex = 0;
    for(int i = 0; i < result->rows; i++) {
        double *row1 = rowAt(first, i);
        for(int j = 0; j < result->columns; j++) {
            double *row2 = rowAt(&t, j);
            double sum = 0.0;
            for (int k = 0; k < first->columns; k++) {
                sum += row1[k] * row2[k];
            }
            result->elements[resultIndex++] = sum;
        }
    }
    deleteMatrix(&t);
}

Running your tests on my M2 macbook, I first got 0ms timings for all 3 multiplications because using -O3, clang would detect that none of the computations' results were used, hence everything could be optimized out.

Adding a signature function and displaying its result forced the code generation:

double hashMatrix(const Matrix *matrix) {
    double sum = 0;
    int index = 0;
    for(int i = 0; i < matrix->rows; i++) {
        double sumrow = 0.0;
        for(int j = 0; j < matrix->columns; j++) {
            sumrow += matrix->elements[index++] * (j + 1);
        }
        sum += sumrow * (i + 1);
    }
    return sum;
}

The timings were consistent with yours:

Time taken: 1031ms, hash=-2.18033e+08
Time taken: 28893ms, hash=-5.78916e+09
Time taken: 245019ms, hash=5.04211e+11

Changing from mmProduct to mmProduct2 produces timings consistent cubic time complexity (using the same random data):

Time taken: 859ms, hash=-2.18033e+08
Time taken: 7080ms, hash=-5.78916e+09
Time taken: 57914ms, hash=5.04211e+11

score 1 · Accepted Answer · answered Mar 23 '23 at 02:41

I tried all the suggestions made, transposing the second matrix and then finding a regular dot product, adding const and __restrict keywords, but it turns out that the reason why my algorithm was so slow was because of cache locality as explained by the article @RaymondChen posted: https://levelup.gitconnected.com/c-programming-hacks-4-matrix-multiplication-are-we-doing-it-right-21a9f1cbf53

Problem

The problem arises from the fact that in my original implementation, the column elements were very far apart, so the CPU has to refetch the next element in the column for the computation.

Not only that, as @RaymondChen also pointed out, the stride length (or size of the column of the second matrix) may be bigger than can be detected by my CPU, and thus the CPU may have to manually compute where the next column element is instead of guessing the location based on previous strides.

Swapping the k and j loops:

When we swap the k and j loops, the problem disappears, with the column elements accessed one after the other. And not only that, it also created opportunities for further optimizations in the code!

Swapping the k and j loops makes it so that with each iteration, our target result[i][j] changes with every iteration, as j is now inside. This is actually very good, as subsequent result elements are spaced one after the other, and now we can use SIMD on these result elements to compute multiple iterations of the loop at once, as opposed to the original algorithm where we have to continuously add to a single result element thousands of times.

Also see that row[k] is now constant for every j loop. This further helps with SIMD, and we no longer have to compute a different row[x] n^3 times. We only need to do it n^2 times now.

// Original:
for(int i = 0; i < rows; i++) {
    double* row = &(first->elements[i * firstColumns]);
    for(int j = 0; j < columns; j++) {
        for(int k = 0; k < firstColumns; k++) {
            /* The result[i][j] memory address remains constant,
               which is not desirable because no SIMD, all
               operations performed sequentially on a single
               result[i][j] memory address

               The elements in the row array used one after the
               other, which is good.
               
               The elements in the column array are spaced out by
               the value of columns, which is bad.
            */
            result->elements[i * columns + j] += row[k] * second->elements[k * columns + j];
        }
    }
}

// Loop swapped:
for(int i = 0; i < rows; i++) {
    double* row = &(first->elements[i * firstColumns]);
    for(int k = 0; k < firstColumns; k++) {
        for(int j = 0; j < columns; j++) {
            /* The result[i][j] memory address is used once, and
               then subsequent result[i][j + n] addresses are used
               which enables SIMD

               The row element is constant, which is very good!
               
               The elements in the column array are used one after
               the other, which is good.
            */
            result->elements[i * columns + j] += row[k] * second->elements[k * columns + j];
        }
    }
}

When I rewrote my mmProduct algorithm such that the second and the third loop is swapped, the speedup was absolutely tremendous: from 36 seconds with 2^11 x 2^11 matrices to just 5.4.

By comparison, I also implemented the transposing second matrix implementation, and while there was a considerable speedup from 36 seconds to just 12.9, the loop swapped version still blows it out of the water with the added benefit that no extra memory needs to be allocated. I attribute the slowdown to the fact that we cannot take advantage of SIMD as we are still repeatedly adding to a single sum variable or result element.

By another comparison, my Strassen multiplication algorithm ran on the matrices with a depth of 7 also computed the product in 5.4 seconds.

When ran on matrices size 2^12 x 2^12, the times were as follows:

Original mmProduct: 391 seconds
Loop swapped mmProduct: 47 seconds (about scaled by a factor of 8)
Strassen depth 8: 40 seconds (about scaled by a factor of 7)
mmProduct with second matrix transposed: 106 seconds (about scaled by a factor of 8)

Unfortunately, in all cases, the transposed version of mmProduct seems to be overshadowed by the loop swapped mmProduct.

Rewritten functions:

void mmProduct(Matrix*__restrict first, Matrix*__restrict second, Matrix*__restrict result) {
    const int rows = result->rows;
    const int columns = result->columns;
    const int size = rows * columns;
    const int firstColumns = first->columns;
    for(int i = 0; i < size; i++) {
        result->elements[i] = 0.0;
    }
    for(int i = 0; i < rows; i++) {
        double* row = &(first->elements[i * firstColumns]);
        for(int k = 0; k < firstColumns; k++) {
            for(int j = 0; j < columns; j++) {
                result->elements[i * columns + j] += row[k] * second->elements[k * columns + j];
            }
        }
    }
}

void mmProductWT(Matrix*__restrict first, Matrix*__restrict second, Matrix*__restrict result) {
    const int rows = result->rows;
    const int columns = result->columns;
    const int operandColumns = first->columns;
    double* secondTransposeElements = (double*) calloc(columns * second->rows, sizeof(double));
    Matrix secondTranspose = { columns, second->rows, secondTransposeElements };
    transpose(second, &secondTranspose);
    for(int i = 0; i < rows; i++) {
        for(int j = 0; j < columns; j++) {
            double sum = 0.0;
            for(int k = 0; k < operandColumns; k++) {
                result->elements[i * columns + j] += first->elements[i * operandColumns + k] * secondTranspose.elements[j * operandColumns + k];
            }
            result->elements[i * columns + j] = sum;
        }
    }
    free(secondTransposeElements);
}

Here is the code for main.c so you can test it for yourself: Remember to replace mmProduct with the loop swapped version.

https://gist.github.com/kevinMEH/98807c5929df84dda793988f7f561763

Thanks to everyone who responded and gave suggestions, special thanks to @RaymondChen.

Excellent! Yet there is no reason to `restrict` the `Matrix` arguments: restricting the `Matrix` structures does not help the compiler as the `elements` pointers could still point to the same array. The destination `Matrix` must be different from both operands, and this must be documented, but `first` and `second` could legitimately point to the same `Matrix`, for example to square it. `const` qualifying `first` and `second` is useful as these structures are not modified. `const` qualifying the local variables themselves is not, as the compiler can easily determine that they are not modified — chqrlie, Mar 24 '23 at 06:46