I'm running the following code for matrix multiplication the performance of which I'm supposed to measure:
for (int j = 0; j < COLUMNS; j++)
#pragma omp for schedule(dynamic, 10)
for (int k = 0; k < COLUMNS; k++)
for (int i = 0; i < ROWS; i++)
matrix_r[i][j] += matrix_a[i][k] * matrix_b[k][j];
Yes, I know it's really slow, but that's not the point - it's purely for performance measuring purposes. I'm running 3 versions of the code depending on where I put the #pragma omp
directive, and therefore depending on where the parallelization happens. The code is run in Microsoft Visual Studio 2012 in release mode and profiled in CodeXL.
One thing I've noticed from the measurements is that the option in the code snippet (with parallelization before the k loop) is the slowest, then the version with the directive before the j loop, then the one with it before the i loop. The presented version is also the one which calculates a wrong result because of race conditions - multiple threads accessing the same cell of the result matrix at the same time. I understand why the i loop version is the fastest - all the particular threads process only part of the range of the i variable, increasing the temporal locality. However, I don't understand what causes the k loop version to be the slowest - does it have something to do with the fact that it produces the wrong result?