How to obtain performance enhancement while multiplying two sub-matrices?

Question

I've got a program multiplying two sub-matrices residing in the same container matrix. I'm trying to obtain some performance gain by using the OpenMP API for parallelization. Below is the multiplication algorithm I use.

#pragma omp parallel for
for(size_t i = 0; i < matrixA.m_edgeSize; i++) {
    for(size_t k = 0; k < matrixA.m_edgeSize; k++) {
        for(size_t j = 0; j < matrixA.m_edgeSize; j++) {
            resultMatrix(i, j) += matrixA(i, k) * matrixB(k, j);
        }
    }
}

The algorithm accesses the elements of both input sub-matrices row-wise to enhance cache usage with the spatial locality.

What other OpenMP directives can be used to obtain better performance from that simple algorithm? Is there any other directive for optimizing the operations on the overlapping areas of two sub-matrices?

You can assume that all the sub-matrices have the same size and they are square-shaped. The resulting sub-matrix resides in another container matrix.

Is it for a practical purpose or to learn? In the first case there are a lot of high performance linear algebra libraries which can do all sorts of optimizations for you specific CPU — Unlikus, Jan 12 '23 at 15:07
"How to obtain performance enhancement while multiplying two sub-matrices?" - Step one would be to turn *on* your compilers optimizer when building the code (debug builds (usually the compiler default) can be *really* slow). — Jesper Juhl, Jan 12 '23 at 15:08
@JesperJuhl My purpose is to enhance this single code piece. Anyway, thanks for the optimization suggestion. — Caglayan DOKME, Jan 12 '23 at 15:12
Have you profiled yet to see where your bottlenecks are? It could well be that OMP isn't the best way for optimizing your problem, but that you should rely on vectorization of one matrix multiplications and use the threads to do the multiplications in parallel (keep matrix data confined to one processor/cache) instead. (And yes I've seen people try to optimize image processing by running multiple FFTs in parallel, with each FFT being optimized to run on multiple cores.. and that will not work) — Pepijn Kramer, Jan 12 '23 at 15:21
@PepijnKramer What's the best way for profiling such an algorithm? — Caglayan DOKME, Jan 12 '23 at 15:45
While looking up how standar library parallelization compares to open MP i found this : https://stackoverflow.com/questions/64326234/c17-parallel-algorithm-vs-tbb-parallel-vs-openmp-performance. Again measurement is important. And if you can't get your hands on a profiler maybe this helps : https://github.com/google/benchmark — Pepijn Kramer, Jan 12 '23 at 16:24

Victor Eijkhout · Answer 1 · 2023-01-13T00:49:44.343

2

For the matrix-matrix product, any permutation of i,j,k indices computes the right result, sequentially. In parallel, not so. In your original code the k iterations do not write to unique locations, so you can not just collapse the outer two loops. Do a k,j interchange and then it is allowed.

Of course OpenMP gets you from 5 percent efficiency on one core to 5 percent on all cores. You really want to block the loops. But that is a lot harder. See the paper by Goto and van de Geijn.

edited Jan 13 '23 at 00:49

answered Jan 12 '23 at 17:09

Victor Eijkhout

5,088
2
22
23

This is true solution. He may have += operator implemented thread-safe but that would be too slow. Maybe he meant to use a temporary variable for accumulation and adding out of loop. We need minimal fully eorking code. – huseyin tugrul buyukisik Jan 12 '23 at 17:20
@huseyintugrulbuyukisik It hadn't occurred to me that the overloaded operator could be an atomic update or so, but as you say, that would be slow. – Victor Eijkhout Jan 12 '23 at 17:30

huseyin tugrul buyukisik · Answer 2 · 2023-01-12T17:20:37.380

I'm adding something related to main matrix. Do you use this code to multiply two bigger matrices? Then one of the sub-matrices are re-used between different iterations and likely to benefit from CPU cache. For example, if there are 4 sub-matrices of a matrix, then each sub-matrix is used twice, to get a value on result matrix.

To benefit from cache most, the re-used data should be kept in the cache of the same thread (core). To do this, maybe it is better to move the work-distribution level to the place where you select two submatrices.

So, something like this:

select sub-matrix A
    #pragma omp parallel for
    select sub-matrix B
    for(size_t i = 0; i < matrixA.m_edgeSize; i++) {
        for(size_t k = 0; k < matrixA.m_edgeSize; k++) {
            for(size_t j = 0; j < matrixA.m_edgeSize; j++) {
                resultMatrix(i, j) += matrixA(i, k) * matrixB(k, j);
            }
        }
    }

could work faster since whole data always stays in same thread (core).

Thanks for the suggestion. I'm not using it for multiplying bigger matrices. It's just a code for a made-up image processing problem. The problem consists of two sub-images that are placed diagonally symmetrical to each other. My eventual purpose is to find any correlation between the size of the overlapped areas of multiplicand matrices and the overall performance of the multiplication algorithm. — Caglayan DOKME, Jan 12 '23 at 16:53
So you are benchmarking matrix size vs timing. Caching can affect performance exponentially depending on data/cache size ratios. — huseyin tugrul buyukisik, Jan 12 '23 at 17:36

How to obtain performance enhancement while multiplying two sub-matrices?

2 Answers2