OpenMP basic temperature modelling with parallel processing taking longer than serial code

Question

I am new to OpenMP and I have to complete this project where I need to resolve a 2D matrix using Jacobi iteration method to resolve a heat conductivity problem using OpenMP.

Essentially It is a plate with four walls at the sides which have fixed temperatures and I need to work out the unknown temperature values in the middle.

The code has been given to us and what I am expected to do is three simple things:

Time the serial code
Parallelise the serial code and compare
Further optimise the parallel code if possible

I have ran the serial code and parallelised the code to make a comparison. Before any optimisation, for some reason the serial code is consistently doing better. I can't help but think I am doing something wrong?

I will try compiler optimisations for both, but I expected parallel code to be faster. I have chosen a large problem size for the matrix including a 100 x 100, 300 x 300 array and every single time almost the serial code is doing better.

Funny thing is, the more threads I add, the slower it gets.

I understand for a small problem size it would be a larger overhead, but I thought this is a large enough problem size?

This is before any significant optimisation, am I doing something obviously wrong that makes it like this?

Here is the code:

Serial code:

#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <omp.h>

int main(int argc, char *argv[])
{

    int m; 
    int n;
    double tol;// = 0.0001;

    int i, j, iter;

    m = atoi(argv[1]);
    n = atoi(argv[2]);
    tol = atof(argv[3]);

    /**
     * @var t, tnew, diff, diffmax,
     * t is the old temprature array,tnew is the new array
     */
    double t[m+2][n+2], tnew[m+1][n+1], diff, difmax;

    /**
     * Timer variables
     * @var start, end 
     */
    double start, end;

    printf("%d %d %lf\n",m,n, tol);

    start = omp_get_wtime();

    // initialise temperature array
    for (i=0; i <= m+1; i++) {
        for (j=0; j <= n+1; j++) {
            t[i][j] = 30.0;
        }
    }

    // fix boundary conditions
    for (i=1; i <= m; i++) {
        t[i][0] = 33.0;
        t[i][n+1] = 42.0;
    }
    for (j=1; j <= n; j++) {
        t[0][j] = 20.0;
        t[m+1][j] = 25.0;
    }

    // main loop
    iter = 0;
    difmax = 1000000.0;
    while (difmax > tol) {

        iter++;

        // update temperature for next iteration
        for (i=1; i <= m; i++) {
            for (j=1; j <= n; j++) {
                tnew[i][j] = (t[i-1][j] + t[i+1][j] + t[i][j-1] + t[i][j+1]) / 4.0;
            }
        }

        // work out maximum difference between old and new temperatures
        difmax = 0.0;
        for (i=1; i <= m; i++) {
            for (j=1; j <= n; j++) {
                diff = fabs(tnew[i][j]-t[i][j]);
                if (diff > difmax) {
                    difmax = diff;
                }
                // copy new to old temperatures
                t[i][j] = tnew[i][j];
            }
        }

    }

    end = omp_get_wtime();

    // print results,
    //Loop tempratures commented out to save performance
    printf("iter = %d  difmax = %9.11lf\n", iter, difmax);
    printf("Time in seconds: %lf \n", end - start);
    // for (i=0; i <= m+1; i++) {
    //  printf("\n");
    //  for (j=0; j <= n+1; j++) {
    //      printf("%3.5lf ", t[i][j]);
    //  }
    // }
    // printf("\n");

}

Here is the parallel code:

#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <omp.h>



int main(int argc, char *argv[])
{
    int m; 
    int n;
    double tol;// = 0.0001;

    /**
     * @brief Integer variables
     * @var i external loop (y column array) counter,
     * @var j internal loop (x row array counter) counter,
     * @var iter number of iterations,
     * @var numthreads number of threads
     */
    int i, j, iter, numThreads;

    m = atoi(argv[1]);
    n = atoi(argv[2]);
    tol = atof(argv[3]);
    numThreads = atoi(argv[4]);

    /**
     * @brief Double variables
     * @var t, tnew -> The variable that holds the temprature, the t is the old value and the tnew is the new value,
     * @var diff Measures the difference,
     * @var diffmax 
     * t is the temprature array, I guess it holds the matrix?
     * 
     */
    double t[m+2][n+2], tnew[m+1][n+1], diff, diffmax, privDiffmax;

    /**
     * Timer variables
     * @var start, end 
     */
    double start, end;

    /**
     * @brief Print the problem size & the tolerance
     * This print statement can be there as it is not part of the parallel region
     * We also print the number of threads when printing the problem size & tolerance
     */
    //printf("%d %d %lf %d\n",m,n, tol, numThreads);
    omp_set_num_threads(numThreads);

    /**
     * @brief Initialise the timer
     * 
     */
    start = omp_get_wtime();

    /**
     * @brief Creating the parallel region:
     * Here both loop counters are private:
     */
    #pragma omp parallel private(i, j)
    {
        /**
         * @brief initialise temperature array
         * This can be in a parallel region by itself
         */
        #pragma omp for collapse(2) schedule(static)
        for (i=0; i <= m+1; i++) {
            for (j=0; j <= n+1; j++) {
                t[i][j] = 30.0;
            }
        }

        // fix boundary conditions
        #pragma omp for schedule(static)
        for (i=1; i <= m; i++) {
            t[i][0] = 33.0;
            t[i][n+1] = 42.0;
        }

        #pragma omp for schedule(static)
        for (j=1; j <= n; j++) {
            t[0][j] = 20.0;
            t[m+1][j] = 25.0;
        }

    }   

    // main loop
    iter = 0;
    diffmax = 1000000.0;

    while (diffmax > tol) {

        iter = iter + 1;

        /**
         * @brief update temperature for next iteration
         * Here we have created a parallel for directive, this is the second parallel region
         */
        #pragma omp parallel for private(i, j) collapse(2) schedule(static)
        for (i=1; i <= m; i++) {
            for (j=1; j <= n; j++) {
                tnew[i][j] = (t[i-1][j] + t[i+1][j] + t[i][j-1] + t[i][j+1]) / 4.0;
            }
        }

        // work out maximum difference between old and new temperatures
        diffmax = 0.0;
        
        /**
         * @brief Third parallel region that compares the difference
         */
        #pragma omp parallel private(i, j, privDiffmax, diff)
        {
            privDiffmax = 0.0;
            #pragma omp for collapse(2) schedule(static)
            for (i=1; i <= m; i++) {
                for (j=1; j <= n; j++) {
                    diff = fabs(tnew[i][j]-t[i][j]);
                    if (diff > privDiffmax) {
                        privDiffmax = diff;
                    }
                    // copy new to old temperatures
                    t[i][j] = tnew[i][j];
                }
            }
            #pragma omp critical
            if (privDiffmax > diffmax)
            {
                diffmax = privDiffmax;
            }
        }
        

    }

    //Add timer for the end
    end = omp_get_wtime();

    // print results,
    //Loop tempratures commented out to save performance
    printf("iter = %d  diffmax = %9.11lf\n", iter, diffmax);
    printf("Time in seconds: %lf \n", end - start);
    // for (i=0; i <= m+1; i++) {
    //  printf("\n");
    //  for (j=0; j <= n+1; j++) {
    //      printf("%3.5lf ", t[i][j]);
    //  }
    // }
    // printf("\n");

}

Here are some of the benchmarks for serial code:

I have ran the code and tested it, I have commented out the print statements as I dont need to see that except for to test. The code runs fine but somehow it is slower than the serial code.

I have an 8 core Apple Mac M1

I am new to OpenMP and can't help but think I am missing something. Any advice would be appreciated

One thing that you can do better in both versions is not to copy over the new values to the old values. Instead use pointers to access the arrays and swap the pointers between iterations. — paleonix, Jul 14 '22 at 11:32
The second thing I see is the `critical` section. You should avoid these if at all possible. In this case you should use a `reduction` clause instead. See page 233 of the [official examples](https://www.openmp.org/wp-content/uploads/openmp-examples-4.5.0.pdf). — paleonix, Jul 14 '22 at 11:34
Until you've used the compiler to optimise the code for execution speed don't get too worked up about execution speed. Crank it to the max and report back. Oh, if possible, try it on another machine too. — High Performance Mark, Jul 14 '22 at 11:45
Thanks, I will try this now, I just felt without any optimisations I still expected multi-threaded application to be faster than serial, but Im new to this so I don't fully get it. I will try, I will check the official examples — , Jul 14 '22 at 11:49
@paleonix The critical section is only over the threads, it's outside the iteration loop. I wouldn't think it has much effect, though, yes, it should be a reduction. So I'm a little surprised that the code doesn't speed up. It looks clean to me. — Victor Eijkhout, Jul 14 '22 at 11:54
Is your machine truly 8 cores, or 4 cores plus hyperthreading? In the latter case, try not using that. Also make sure to `OMP_PROC_BIND=true` to prevent thread migration. — Victor Eijkhout, Jul 14 '22 at 11:55
@VictorEijkhout You are right, this critical section is relatively harmless unless you are scaling to very big core counts (and don't have enough work). But for learning write good OMP code, this is still an important remark. — paleonix, Jul 14 '22 at 11:58
Ok, so I compiled and ran your code (compiled `-O2`) on my old Macbook Air (Intel i3? i5?) and I definitely get a factor of 2 speedup going from one core to 2. — Victor Eijkhout, Jul 14 '22 at 12:00
Oh so if I get this correct, after optimisation the parallel code ran 2x faster? I will test after optimisation as well and update the question with results — , Jul 14 '22 at 12:04
I also find it strange that without optimisation, the more threads I add, the slower it gets — , Jul 14 '22 at 12:05
Does this answer your question? [C loop optimization help for final assignment (with compiler optimization disabled)](https://stackoverflow.com/questions/32000917/c-loop-optimization-help-for-final-assignment-with-compiler-optimization-disabl) — paleonix, Jul 14 '22 at 12:09
I will have a look at it. They expect me to show timing without any aggressive optimisation for the first task. So I will try optimise it and then do this for the next task — , Jul 14 '22 at 13:03
That doesn't have any OpenMP so I can't rely on that answer, I need to know why the parallel code is slower than the serial code. — , Jul 14 '22 at 13:18
Define "aggressive optimization", where does it start? I would say `-O2` is default release build optimization, while `-O3` or `-Ofast` might be considered "aggressive". If benchmarking unoptimized builds is meaningless without OpenMP, why would that change when adding threads? For all I know the threads could be executed sequentially w/o optimization. Unoptimized builds should only be used for debugging purposes. — paleonix, Jul 14 '22 at 13:42
Okay I am trying to optimise it now, have you seen the answer below? Suggesting I take away the collapse clause, but I am concerned the results maybe inconsistent as I have a nested loop. If I take away the collapse clause where do I put the parallel for? — , Jul 14 '22 at 13:57
As long as the number of iterations for the outermost loop (i.e. the parallel one) is much bigger than the number of cores you have, you don't need to worry. Leaving away the `collapse` should be fine in your case. — paleonix, Jul 14 '22 at 14:05
Another general recommendation: Only define variables where you really need them, so as late as possible: i.e. use `for (int i = 0; ...)` and don't pre-define all your variables at the start of a function. This makes code more readable and also avoids accidentally sharing variables between threads. It might also result in faster binaries (See register pressure), but I wouldn't bet on that due to how good optimization is today. — paleonix, Jul 14 '22 at 14:47
@VictorEijkhout I have a Mac book air M1, I checked and it has 8 cores — , Jul 14 '22 at 15:51

score 4 · Accepted Answer · edited Jul 14 '22 at 19:58

The problem comes from the overhead of collapse(2) on Clang. I can reproduce the problem on Clang 13.0.1 in both -O0 and -O2 on a x86-64 i5-9600KF processor, but not on GCC 11.2.0. Clang generates an inefficient code when collapse(2) is used: it uses an expensive div/idiv instruction in the hot loop to be able to compute the i and j indices. Indeed, here is the assembly code of the hot loop of the sequential version (in -O1 to make the code more compact):

.LBB0_27:                               #   Parent Loop BB0_15 Depth=1
        movsd   xmm3, qword ptr [rbx + 8*rsi]   # xmm3 = mem[0],zero
        movapd  xmm5, xmm3
        subsd   xmm5, qword ptr [rdi + 8*rsi]
        andpd   xmm5, xmm0
        maxsd   xmm5, xmm2
        movsd   qword ptr [rdi + 8*rsi], xmm3
        add     rsi, 1
        movapd  xmm2, xmm5
        cmp     r12, rsi
        jne     .LBB0_27

Here is the parallel counterpart (still in -O1):

.LBB3_4:
        mov     rax, rcx
        cqo
        idiv    r12                     # <-------------------
        shl     rax, 32
        add     rax, rdi
        sar     rax, 32
        mov     rbp, rax
        imul    rbp, r13                # <-------------------
        shl     rdx, 32
        add     rdx, rdi
        sar     rdx, 32
        add     rbp, rdx
        movsd   xmm2, qword ptr [r9 + 8*rbp]    # xmm2 = mem[0],zero
        imul    rax, r8                # <-------------------
        add     rax, rdx
        movapd  xmm3, xmm2
        subsd   xmm3, qword ptr [rsi + 8*rax]
        andpd   xmm3, xmm0
        maxsd   xmm3, xmm1
        movsd   qword ptr [rsi + 8*rax], xmm2
        add     rcx, 1
        movapd  xmm1, xmm3
        cmp     rbx, rcx
        jne     .LBB3_4

There are much more instructions to execute because the loop spent most of the time computing indices. You can fix this by not using the collapse clause. Theoretically, it should be better to provide more parallelism to compilers and runtime to let them make the best decisions, but in practice they are not optimal and they often need to be assisted/guided. Note that GCC uses a more efficient approach that consists in computing the division only once per block, so compilers can do this optimization.

Results

With `collapse(2)`:
- Sequential:  0.221358 seconds
- Parallel:    0.274861 seconds

Without:
- Sequential:  0.222201 seconds
- Parallel:    0.055710 seconds

Additional notes on performance

For better performance, consider using -O2 or even -O3. Also consider using -march=native. -ffast-math can also help if you do not use special floating-point (FP) values like NaN and you do not care about FP associativity. Not copying the array every time and using a double-buffering method also helps a lot (memory-bound codes do not scale well). Then consider reading a research paper for better performance (trapezoidal tiling can be used to boost the performance even more but this is quite complex to do). Also note that not using collapse(2) reduce the amount of parallelism which might be a problem on a processor with a lot of cores but in practice having a lot of cores operating on a small array tends to be slow anyway (because of false-sharing and communications).

Special note for M1 processors

M1 processors are based on a Big/Little architecture. Such an architecture is good to make sequential codes faster thanks to the few "big" cores that run fast (but also consume a lot of space and energy). However, running efficiently parallel core is harder because the "little" cores (which are energy efficient and small) are much slower than the big ones introducing load-imbalance issue if all kinds of cores are running simultaneously (IDK if this is the case on M1 by default). One solution is to control the execution so to only use the same kind of core. Another solution is to use a dynamic scheduling so to balance the work automatically at runtime (eg. using the clause schedule(guided) or even schedule(dynamic)). The second solution tends to add significant overhead and is known to cause other tricky issues on (NUMA-based) computing servers (or even recent AMD PC processors). It is also important to note that the scaling will not be linear with the number of threads because of the performance difference between the big and little cores. Such architecture is currently poorly efficiently supported by a lot of applications because of the above issues.

Thank you for this, I will try take away collapse, but because I have a nested loop, without collapse, where do I put the parallel for? On the inside or outside? Please can you show an example of how to parallelise a nested loop without collapse? — , Jul 14 '22 at 13:54
@theMyth You put it outside. Putting it inside gives worse cache locality. I.e. you just need to get rid of the `collapse` clause, nothing else. — paleonix, Jul 14 '22 at 13:56
I don't know why It didn't make any difference, I am updating the question — , Jul 14 '22 at 14:08
Do you think that one possibility is that because the code is entering and exiting the parallel region twice in every iteration of the while loop, that must be adding to the overhead? — , Jul 14 '22 at 14:29
@theMyth This is possible but unlikely to be the case in `-O0` (though AFAIK M1 processor have a higher latency for parallel code communications). Note the Clang runtime should not recreate the thread in this case. You can increase to domain size to to check if results are better (ie. add more work). If this is not better, then this is not due to the overhead of entering/exiting the parallel loop. Note that M1 processor are a bit special so I will add a note on this in the answer. — Jérôme Richard, Jul 14 '22 at 14:37
Thanks, I am updating my question too, by the way if it helps, I am using the gcc-11 compiler that came with home-brew. I apologise I am still new to C, so a lot of what you are saying I am not familiar with :) — , Jul 14 '22 at 14:39
@theMyth As it would be very easy to merge the two parallel regions by having the `diffmax = 0.0;` at the beginning of the `while` region, I would just benchmark both versions to be sure. — paleonix, Jul 14 '22 at 14:51
Yes, I will test this and update the question. Can I not put the whole while loop inside one parallel region and keep the diffmax variable inside a single clause? — , Jul 14 '22 at 15:01
@theMyth It's possible, but that might be premature optimization. If you actually see a speedup by merging the two regions it might be worth it, but I would try that out first. Another experiment worth benchmarking would be to actually do loop-fusion. If you got rid of the copy between new and old as I recommended earlier (and also recommended in this answer as "double-buffering"), there is nothing holding you from combining the loops. This can both increase and decrease performance due to fewer loads, but higher register pressure. — paleonix, Jul 14 '22 at 15:04
I figured out what the problem is, Its this Mac M1. It has 4 performance cores and 4 cores for "efficiency" Its actually making it slower. Strangely, the more I optimise the code, the worse it gets. I can't even be sure the results will be consistent. I guess I need to get like an 8 core VM with Linux and try this. Yes removing the collapse clause helps but that benefit is very marginal and its still lacking. Which is why with 6 cores its worse than with 8 cores — , Jul 15 '22 at 11:56
@theMyth I indicated such a thing in the "*Special note for M1 processors*" section. Specifically, "*One solution is to control the execution so to only use the same kind of core*" (the big core if possible) and the "*load-imbalance*". For the former, please consider binding threads to cores (eg. using taskset is available on Mac, and using hwloc so to check the mapping). [This post](https://stackoverflow.com/questions/64409563#64415109) can help a bit for that. — Jérôme Richard, Jul 15 '22 at 12:29
Thank you, yes I am sorry I saw that yesterday, but I was so stressed for deadlines, I did not go through it properly. I have accepted your answer :) — , Jul 15 '22 at 12:43

OpenMP basic temperature modelling with parallel processing taking longer than serial code

1 Answers1

Results

Additional notes on performance

Special note for M1 processors

Linked