I am trying to speed up matrix multiplication so it will have much better performance than the naive implementation. My goal is to speed it up to be 150 times faster. Here are things that I have tried in my implementation so far:
-
- Allocating matrix elements in contiguous blocks for better cache efficiency.
-
- Transposing the second matrix, which is to be accessed column-wise, to arrange columns into consecutive memory blocks.
-
- Using SIMD instructions.
-
- Using openMP to parallelize for loops.
After step 1 and 2, my mat mul became 4 times faster than the naive implementation. After using SIMD, it became 17 times faster. After using openMP, it became 56 times faster. I was expecting to see a much larger gain in speed with openMP, like at least a 6 to 8 times boost. What could I be missing? My code roughly looks like this:
#pragma omp parallel for collapse(2)
for (int i = 0; i < result.rows; i += 1) {
for (int j = 0; j < result.cols; j += 1) {
double product = 0.0;
for (int k = 0; k < matrix1.cols / 4 * 4; k += 4) {
//Use _mm256_mul_pd and _mm256_add_pd to process 4 elements at a time.
}
//Do _mm256_store_pd and set the product.
result.mat[r][c] = product;
for (int k = matrix1.cols / 4 * 4; k < matrix1.cols; k += 1) {
//Tail case
}
}
}
I want to boost my speed to at least 100 times faster. i.e. to be 2 times faster than my current benchmark. How else should I optimize my code?