I have a standard matrix multiplication algorithm.
// matrix.h
typedef struct Matrix {
int rows;
int columns;
double *elements;
} Matrix;
double *addressAt(Matrix *matrix, int row, int column);
double *rowAt(Matrix *matrix, int row);
double dotProductColumnArray(Matrix *matrix, const int column, double *array);
void mmProduct(Matrix *first, Matrix *second, Matrix *result);
// matrix.c
// Helper function to find address of elements at row and column.
inline double *addressAt(Matrix *matrix, int row, int column) {
return &(matrix->elements[row * matrix->columns + column]);
}
// Helper function to find row array
inline double *rowAt(Matrix *matrix, int row) {
return &(matrix->elements[row * matrix->columns]);
}
// Finds the dot product of a column of a matrix and an array.
double dotProductColumnArray(Matrix *matrix, const int column, double *array) {
double sum = 0.0;
const int rows = matrix->rows, columns = matrix->columns;
int j = column;
const double *elements = matrix->elements;
for (int i = 0; i < rows; i++) {
sum += array[i] * elements[j];
j += columns;
}
return sum;
}
void mmProduct(Matrix *first, Matrix *second, Matrix *result) {
const int rows = result->rows;
const int columns = result->columns;
for (int i = 0; i < rows; i++) {
double *row = rowAt(first, i);
for (int j = 0; j < columns; j++) {
*addressAt(result, i, j) = dotProductColumnArray(second, j, row);
}
}
}
// Snippet of main.c
// Fills the matrix with random numbers between -1 and 1
void randomFill(Matrix *matrix);
int main() {
struct timeval timestamp;
long start, now;
Matrix first, second, result;
// 2^10
first = createMatrix((int)pow(2, 10), (int)pow(2, 10));
second = createMatrix((int)pow(2, 10), (int)pow(2, 10));
randomFill(&first);
randomFill(&second);
result = createMatrix((int)pow(2, 10), (int)pow(2, 10));
gettimeofday(×tamp, NULL);
start = timestamp.tv_sec * 1000 + timestamp.tv_usec / 1000;
mmProduct(&first, &second, &result);
gettimeofday(×tamp, NULL);
now = timestamp.tv_sec * 1000 + timestamp.tv_usec / 1000;
printf("Time taken: %ldms\n", now - start);
deleteMatrix(&first);
deleteMatrix(&second);
deleteMatrix(&result);
// Same code as above but for 2^11, 2^12, replacing pow() when necessary.
// ...
// ...
}
Full code here to run it yourself: https://gist.github.com/kevinMEH/c37bda728bf5ee53c32a405ef611e5a7
When I run the algorithm with two 210 x 210 matrices, it takes about 2 seconds to multiply.
When I run the algorithm with two 211 x 211 matrices, I expect the algorithm to multiply them in 8 * 2 seconds ~ 16 seconds to multiply. However, it takes 36 seconds to multiply instead.
When I run the algorithm with two 212 x 212 matrices, I expect the algorithm to multiply them in 64 * 2000 milliseconds ~ 128 seconds to multiply. Worst case, I expect 8 * 36 = 288 seconds to multiply. However, it takes 391 seconds to multiply.
(Compiled with Apple Clang 14.0.0 with -O1, but with -Ofast and with no optimizations it scales similarly, still not by a factor of 8 but greater. Ran on Apple M1 chip.)
I'm perplexed as to why this is. The number of additions and multiplications operations performed is exactly (2n)3, and when I increase the width of the matrices by a factor of 2, I expect the subsequent multiplication operation to only take 23 = 8 times longer each time.
Suspects:
The addressAt
and rowAt
helper functions is contributing to the slowdown: This does not make sense; the number of additions and multiplications scale by n3, but calls to addressAt
only scale by n2 and rowAt
only scale by n, so we should actually see a relative increase in speed. (8 * (n3 + n2 + n) > 8n3 + 4n2 + 2n)
I've also manually inlined the functions, and found there to be no noticeable speed increase, so it should not be because of the function call overheads.
Subnormal numbers: My random number generator gets a random u64, and divides it by 2 * 0x8..0u, yielding a number between 0 and 1, so there should be no subnormal numbers. Nonetheless, in the randomFill function I multiplied everything by 100000 and it was still the same result.
dotProductColumnArray()
: I suspect that this is the culprit, but I cannot find a reason why.
My best guess is that the slowdown is because of CPU prefetching the wrong column elements. When we are incrementing j
by the number of columns, the CPU is confused by this and does not prefetch the array elements at matrix->elements[j + n * columns]
because it did not expect j
to be incremented by columns (so like it prefetches j
, j + 1
, or whatever, but not j + columns
), but that does not make sense to me as columns
is constant, so the CPU should know better.
Side note:
I also wrote mmStrassen()
which implements the Strassen matrix multiplication algorithm, and when I tested it, it was expectedly 7 times longer each time. It was tested together following the regular mmProduct
function, so it is not a running time issue.
Also any optimization recommendations is welcome, thank you.
Thanks and apologies in advance for the long question.