Is it normal for parallel algoritm to be slower than sequential one?

Question

So I have a task to compare sequential and parallel version of the code.

#sequential
for (int k = 0; k < ROWS; k++) {
        for (int i = 0; i < COLUMNS; i++) {
            for (int j = 0; j < COLUMNS; j++) {
                c[i][j] += a[i][k] * b[k][j];
}}}

#parallel
omp_set_num_threads(8);
int r = 128;
#pragma omp parallel for
    for (int i = 0; i < ROWS; i += r) {
        for (int j = 0; j < COLUMNS; j += r) {
            for (int k = 0; k < COLUMNS; k += r) {
                for (int ii = i; ii < i + r; ii++) { 
                    for (int jj = j; jj <j + r; jj++) {
                        for (int kk = k; kk < k + r; kk++) {
                            c[ii][jj] += a[ii][kk] * b[kk][jj];
                        }}}}}}

!I have to use this loop nesting! The problem that I have is that parallel version of the code is acually slower than sequential one. Do I make a mistake somewhere or it is like this bcs of the loop nesting? f.e for size of matrix = 1024, seq = 0.61[s], par = 1.16[s]

Edit:

#pragma omp parallel for reduction(+:sum)
    for (int k = 0; k < ROWS; k++) {
        for (int i = 0; i < COLUMNS; i++) {
            sum = 0.0;
            for (int j = 0; j < COLUMNS; j++) {
                sum = a[i][k] * b[k][j];
                c[i][j] += sum;
            }
        }
    }

Ok, so you are trying out tiling to optimize your matrix product (you may want to mention these kinds of keywords in your question to make it easier to parse). What is your edit trying to tell us? You still need the `k, i, j` loop ordering? — paleonix, Aug 15 '22 at 11:52
I need to use k,i,j loop order for 3-loop version of the code. I tried to get rid of race condition with #reduction (It looks like it's working) but now I have a problem with making the parallel code faster than it's sequential version. — Siarczansodu99, Aug 15 '22 at 12:02
Does this answer your question? [OpenMP for matrix multiplication](https://stackoverflow.com/questions/24399525/openmp-for-matrix-multiplication) — paleonix, Aug 15 '22 at 12:05

score 1 · Answer 1 · answered Aug 15 '22 at 12:06

1

The edited code is bogus: reduction(+:sum) is useless since sum is set and not just incremented. In practice, there is a race condition in it because threads can accesses to the same c[i][j] value.

The naive sequential code is not so bad as it make contiguous accesses in j (a[i][k] is a constant for the j loop and should be optimized by the compiler). The cache is not efficiently used because the full c matrix is read/written for each k value.

For the (first) parallel code, the main problem is certainly cache trashing. Indeed, using a blocking strategy is a good idea but the loop for (int kk = k; kk < k + r; kk++) make this computation very cache unfriendly compared to the serial code: it cannot be vectorized efficiently by the compiler and make 1 cache line access per fetch (typically aligned to the same power of two address, which is inefficient due to N-way cache associativity). Please consider moving this loop ahead so for the blocking code to be faster.

Besides, note that there are many more optimizations to make this code much faster independently of using multiple thread. This include using manual SIMD instructions, tuning compiler options (eg. -march=native -O3), loop unrolling, register blocking, using the Strassen algorithm, etc. Optimized BLAS libraries like OpenBLAS, BLIS, MKL already include such optimizations.

answered Aug 15 '22 at 12:06

Jérôme Richard

41,678
6
29
59

Ok. So what is the most efficient way to eliminate race condition? Putting #pragrmaompatomic just before writing to c matrix? – Siarczansodu99 Aug 15 '22 at 12:33
No atomics will make the code far slower because compiler cannot use SIMD instructions and the processor execute such operation much more slowly. The second code can be fixed by doing a reduction on the whole matrix but this is inefficient because each thread must have its own matrix (no to mention it require far more memory on many-core systems). Thus, it is not a good idea. The first parallel code is a good start but you need to change the loop ordering and then apply the optimizations mentions at the end of the answer. – Jérôme Richard Aug 15 '22 at 12:41
Well, the problem is that I got an assigment that makes me work with only KIJ version of the code. Can You show me how to reduce matrixes then? I tried something like this #pragma omp parallel for reduction(+:c[:1024][:1024]), but It doesnt work – Siarczansodu99 Aug 15 '22 at 12:44
I do not understand the logic since your first parallel code was a IJK loop. For the reduction, I would have done something like that or used a manual version (ie. allocating N matrices for the N threads and perform a final reduction on it). Note that array reduction require a quite recent compiler. An alternative solution is to use a `#pragma omp parallel` section + a `#pragma omp for` for the inner loops. – Jérôme Richard Aug 15 '22 at 16:43
This is basically the same question as you asked yesterday, not? – Victor Eijkhout Aug 15 '22 at 20:30
@VictorEijkhout Indeed, it looks very similar... – Jérôme Richard Aug 15 '22 at 22:10

Is it normal for parallel algoritm to be slower than sequential one?

1 Answers1