Not sure how to explain some of the performance results of my parallelized matrix multiplication code

Question

I'm running this code in OpenMP for matrix multiplication and I measured its results:

#pragma omp for schedule(static)
for (int j = 0; j < COLUMNS; j++)
    for (int k = 0; k < COLUMNS; k++)
        for (int i = 0; i < ROWS; i++)
            matrix_r[i][j] += matrix_a[i][k] * matrix_b[k][j];

There are different versions of the code based on where i put the #pragma omp directive - before the j loop, k loop, or the i loop. Also, for every one of those variants I ran different versions for default static scheduling, static scheduling with chunks 1 and 10 and dynamic scheduling with the same chunks. I also measured the number of DC accesses, DC misses, CPU clocks, retired instructions, and other performance indicators in CodeXL. Here are the results for the matrix of size 1000x1000 on AMD Phenom I X4 945:

Results of performance measurements

Where multiply_matrices_1_dynamic_1 is a function with #pragma omp before the first loop and dynamic scheduling with chunk 1, etc. Here are some things I don't quite understand about the results and would appreciate help:

The default static version with parallelization before the inner loop runs in 2,512s, while the sequential version runs in 31,683s - that's despite the fact that it ran on a 4-core machine, so I imagined that the biggest possible speedup would be 4x. Is it logically possible for that to occur, or is this some error? How can I explain that?
- CodeXL says that the 3rd version with static scheduling has a much lower number of DC accesses (and misses) than the other versions. Why is that? Isn't it largely because all the parallel threads operate on the same cell of the b matrix. Is that it?
- I know that the 2nd version is an incorrect one due to the fact that the threads might operate on the same cell of the R matrix, which might cause race conditions. Why does it have a lower performance, though? Does incorrectness cause it somehow?
- I realize that dynamic scheduling causes an overhead, which is why the code slows down when using it. Also, with the 3rd version (with pragma before the i loop) the scheduling is performed many more times (billion times with a 1000x1000 matrix), so the algorithm is much less performant. However, this scheduling doesnt seem to cause slowdown in the 2nd version (with pragma before the 2nd loop). Why is that?

Also, I'm confused about the relation of TLB misses to cache misses. When is DTLB used specifically? My professor's document says that every DC access is a DTLB request, but I don't understand how that works - the number of TLB misses is often bigger than the number of DC accesses. How do I compute TLB miss ratio? My professor says it's TBL misses / DC accesses. He also says I can compute temporal locality by the cache hit ratio and spatial locality by TLB hit ratio. How does that work exactly?

The order of your loops looks like what I think would be efficient with Fortran i.e. you want to iterate over the rows of a column with Fortran. But with C you want to iterate over the columns of a row. So with C make your outermost loop over `i` and your innermost loop over `j` and that should be much more cache friendly. See my answer. — Z boson, Jan 19 '16 at 10:15
How are you allocating the matrices? What are your compile options? Can you provide a more complete example? — Z boson, Jan 19 '16 at 10:34

score 2 · Answer 1 · edited May 23 '17 at 12:15

Gilles has the right idea that your code is cache unfriendly but his solution still has a similar problem because it does the reduction over k on matrix_b[k][j].

One solution is to calculate a transpose of matrix_b then you can run over matrix_bT[j][k] over k which is cache friendly. The transpose goes as O(n^2)) and matrix multiplication as O(n^3) so the cost of the transpose goes 1/n. i.e. for large n it becomes negligible.

But there is an even easier solution than using a transpose. Do the reduction over j like this:

#pragma omp for schedule(static)
for (int i = 0; i < ROWS; i++ ) {
    for (int k = 0; k < COLUMNS; k++ ) {
        for ( int j = 0; j < COLUMNS; j++ ) {
           matrix_r[i][j] += matrix_a[i][k]*matrix_b[k][j];
        }
    }
}

Gilles' method requires two reads from memory per iteration whereas this solution requires two reads and a write to memory per iteration but it's much more cache friendly which more than makes up for the write to memory.

score 1 · Answer 2 · answered Jan 19 '16 at 07:09

I'm not sure what your figures show, but what I'm sure is that your code, the way it is written at the moment, is almost as ineffective as can be. So discussing about details of this or that counter's figure will make very little sense until you have made the code sensibly effective.

The reason for which I claim your code is ineffective is because the order in which you organised your loops is probably the worst possible one: none of the accesses to your data are linear, leading to a super ineffective use of the cache. By simply swapping around loops, your should dramatically improve your performance and start looking at what can be done more to improve it further.

This version for example should already be much better (not tested):

#pragma omp for schedule( static )
for ( int i = 0; i < ROWS; i++ ) {
    for ( int j = 0; j < COLUMNS; j++ ) {
        auto res = matrix_r[i][j]; // IDK the type here
        #pragma omp simd reduction( + : res )
        for ( int k = 0; k < COLUMNS; k++ ) {
           res += matrix_a[i][k] * matrix_b[k][j];
        }
        matrix_r[i][j] = res;
    }
}

(NB: I added the simd directive just because it looked appropriate, but it was by no mean the point here)

From there, experimenting with loop collapsing, thread scheduling and/or loop tiling will start to make sense.

Not sure how to explain some of the performance results of my parallelized matrix multiplication code

2 Answers2

Linked