I'm running this code in OpenMP for matrix multiplication and I measured its results:
#pragma omp for schedule(static)
for (int j = 0; j < COLUMNS; j++)
for (int k = 0; k < COLUMNS; k++)
for (int i = 0; i < ROWS; i++)
matrix_r[i][j] += matrix_a[i][k] * matrix_b[k][j];
There are different versions of the code based on where i put the #pragma omp
directive - before the j loop, k loop, or the i loop. Also, for every one of those variants I ran different versions for default static scheduling, static scheduling with chunks 1 and 10 and dynamic scheduling with the same chunks. I also measured the number of DC accesses, DC misses, CPU clocks, retired instructions, and other performance indicators in CodeXL. Here are the results for the matrix of size 1000x1000 on AMD Phenom I X4 945:
Results of performance measurements
Where multiply_matrices_1_dynamic_1
is a function with #pragma omp
before the first loop and dynamic scheduling with chunk 1, etc. Here are some things I don't quite understand about the results and would appreciate help:
- The default static version with parallelization before the inner loop runs in 2,512s, while the sequential version runs in 31,683s - that's despite the fact that it ran on a 4-core machine, so I imagined that the biggest possible speedup would be 4x. Is it logically possible for that to occur, or is this some error? How can I explain that?
- CodeXL says that the 3rd version with static scheduling has a much lower number of DC accesses (and misses) than the other versions. Why is that? Isn't it largely because all the parallel threads operate on the same cell of the b matrix. Is that it?
- I know that the 2nd version is an incorrect one due to the fact that the threads might operate on the same cell of the R matrix, which might cause race conditions. Why does it have a lower performance, though? Does incorrectness cause it somehow?
- I realize that dynamic scheduling causes an overhead, which is why the code slows down when using it. Also, with the 3rd version (with pragma before the i loop) the scheduling is performed many more times (billion times with a 1000x1000 matrix), so the algorithm is much less performant. However, this scheduling doesnt seem to cause slowdown in the 2nd version (with pragma before the 2nd loop). Why is that?
Also, I'm confused about the relation of TLB misses to cache misses. When is DTLB used specifically? My professor's document says that every DC access is a DTLB request, but I don't understand how that works - the number of TLB misses is often bigger than the number of DC accesses. How do I compute TLB miss ratio? My professor says it's TBL misses / DC accesses. He also says I can compute temporal locality by the cache hit ratio and spatial locality by TLB hit ratio. How does that work exactly?