Parallel exection using OpenMP takes longer than serial execution in C?

Question

The serial version takes less time than the parallel one.

/*Serial Version*/
double start = omp_get_wtime();

for (i = 0; i < 1100; i++) {
    for (j = i; j < i + 4; j++) {
        fprintf(new_file, "%f  ", S[j]);
    }
    fprintf(new_file, "\n");
    m = compute_m(S + i, 4);
    find_min_max(S + i, 4, &min, &max);

    S_i = inf(m, min, b); 
    S_s = sup(m, max, b); 

    if (S[i + 2] < S_i)
        Res[i] = S_i;
    else if (S[i + 2] > S_s)
        Res[i] = S_s;
    else
        Res[i] = ECG[i + 2];
    fprintf(output_f, "%f\n", Res[i]);
}

    

    double end = omp_get_wtime();
    printf("\n ------------- TIMING :: Serial Version -------------- ");
    printf("\nStart = %.16g\nend = %.16g\nDiff_time = %.16g\n", start, end, end - start);

#Parallel version 
    double start = omp_get_wtime();
#pragma omp parallel for
    for (i = 0; i < 1100; i++) {
#pragma omp parallel for
        for (j = i; j < i + 4; j++) {
            serial code ...
        }
        serial code ...
    }
double end = omp_get_wtime();
    printf("\n ------------- TIMING :: Serial Version -------------- ");
    printf("\nStart = %.16g\nend = %.16g\nDiff_time = %.16g\n", start, end, end - start);

I have tried multiple times, the serial execution is always faster why?

why is serial execution faster here? am I calculation the execution time in the right way?

Depends on what "serial code" is. If it's something reasonably fast to execute, there's no obvious reason why multi-threading would be more effective. A serial version might lead to better cache use etc. It's impossible to speculate about why without real code. — Lundin, Aug 11 '20 at 11:25
Though you can try to comment out the inner loop #pragma omp and see if that makes any difference. — Lundin, Aug 11 '20 at 11:26
@Lundin serial code include some operations like reading, writing in text files and compute some parameters and I wanted parallelize the for loop to compare serial version with parallel one — toto01, Aug 11 '20 at 12:01
*serial code include some operations like reading, writing in text files* I/O is one sure way to tie a large deadweight to a parallel program. Unless, that is, you have a parallel file system - do you ? — High Performance Mark, Aug 11 '20 at 12:06
thanks all for your time. @HighPerformanceMark I have just a normal text file not parallel one, I edit the code to show what kind of operations I'm working on. it still the same problem. — toto01, Aug 11 '20 at 14:31
First of all, like mentioned above you don't want to have a second omp parallel pragma infront of that small loop. It could make omp spawn way more threads than you have cores in your hardware. Printing to file in parallel can be done, but not without a lot more diligence, as you get race conditions between the threads trying to write to the same file. If you are "lucky" the write process just makes the program very slow, as there is contention with threads waiting for the file. If you are unlucky, your file contents will be totally mixed up. — paleonix, Aug 11 '20 at 14:48
So you probably want to cut your big loop into three independent loops. The first for writing out S serially, the second one to calculate Res in parallel and the third one to write out res serially again. — paleonix, Aug 11 '20 at 14:53
As you seem to be a beginner regarding omp, I would suggest to add a default(none) clause to the parallel pragma and then shared/privatefirstprivate/lastprivate clauses to make sure everything is in the right place. — paleonix, Aug 11 '20 at 16:17
First of all, @Paul thanks for your replying but when I just put one omp my file contents totally mixed up and the way that I wrote it's working fine the only prolem is time. yes u r I'm a beginner regarding omp if u can please clarify the last suggestion how can I add them to my code it would be very appreciated. — toto01, Aug 11 '20 at 17:49
I have no idea why it would work with the nested parallel region, but I wouldn't trust it, because it should introduce even more race conditions/threads trying to get the same ressource (file). Just keep file IO out of your parallel sections. — paleonix, Aug 11 '20 at 18:10
That last suggestion probably doesn't directly affect your problem, but it always safer to specify which data is shared between the threads (S and Res for sure) and which is private to each thread. With the private clause the instances inside the parallel region will NOT be initialized. So if you want every thread to have a private copy of some data you use the firstprivate clause and if you need the content of a private variable after the parallel region, you specify the lastprivate clause. — paleonix, Aug 11 '20 at 18:11
b could be shared or firstprivate (if it is a (big) array it should be shared, if it's just a variable it can be firstprivate). m, min, max, S_i and S_s seem to only be filled inside the loop, so they should be private. E.g. #pragma omp parallel for default(none) shared(S, Res, b) private(m, min, max, S_i, S_s) — paleonix, Aug 11 '20 at 18:13
@Paul. `#pragma omp parallel { int ID = omp_get_thread_num(); if (ID == 0) { serial code } if (ID == 1) {the same code } etc ... }` please can u explain what this lines means. me I understnd that the free thread of them (0,1 ..) will be charged to excute the code.is that right ? — toto01, Aug 12 '20 at 11:04
Pretty much yes, but normally you would rather use #pragma omp parallel sections { #pragma omp section { serial code } #pragma omp section { serial code }} This way it works even when you have only one thread. — paleonix, Aug 12 '20 at 19:33
You could try a collapse clause instead of two parallel for statements. Check [this answer](https://stackoverflow.com/a/13357158/11365539) — Warpstar22, Aug 13 '20 at 01:11
@Warpstar22 Not really. I think it wont even work because for collapse all the work has to be inside both loops. And again: File IO in parallel is not trivial and should be avoided if you don't know exactly what you are doing. If it wasn't file IO in the inner loop, it still would not make sense, as long as the outer loop has significantly more iterations than you have cores to execute your threads on. For todays hardware 1100 is more than enough parallelism. — paleonix, Aug 13 '20 at 11:38
@Paul, yes it does not work because I already try collapse before i posted the question here and it makes the file miksed up. — toto01, Aug 13 '20 at 12:21

paleonix · Accepted Answer · 2020-08-13T15:08:44.707

Assuming that compute_m does not write to S and that find_min_max does not write to S_i or read from min and max, this should work.

/*Parallel Version A*/
double start = omp_get_wtime();

const int nThreads = omp_get_max_threads();

#pragma omp parallel sections num_threads(2) default(none) shared(S, Res, ECG, b, min, max, m, S_i, S_s, nThreads)
{
#pragma omp section
    for (i = 0; i < 1100; i++) {
        for (j = i; j < i + 4; j++) {
            fprintf(new_file, "%f  ", S[j]);
        }
        fprintf(new_file, "\n");
    }
#pragma omp section
    {
#pragma omp parallel for num_threads(nThreads - 1) default(none) shared(S, Res, ECG, b) private(min, max, m, S_i, S_s)
        for (i = 0; i < 1100; i++) {
            m = compute_m(S + i, 4);
            find_min_max(S + i, 4, &min, &max);

            S_i = inf(m, min, b); 
            S_s = sup(m, max, b); 

            if (S[i + 2] < S_i)
                Res[i] = S_i;
            else if (S[i + 2] > S_s)
                Res[i] = S_s;
            else
                Res[i] = ECG[i + 2];
        }
        for (i = 0; i < 1100; i++) {
            fprintf(output_f, "%f\n", Res[i]);
        }
    }
}

double end = omp_get_wtime();
printf("\n ------------- TIMING :: Parallel Version A -------------- ");
printf("\nStart = %.16g\nend = %.16g\nDiff_time = %.16g\n", start, end, end - start);

Another a bit less complicated solution would be this one

/*Parallel Version B*/
double start = omp_get_wtime();

#pragma omp parallel default(none) shared(S, Res, ECG, b) private(min, max, m, S_i, S_s)
{
#pragma omp for 
    for (i = 0; i < 1100; i++) {
        m = compute_m(S + i, 4);
        find_min_max(S + i, 4, &min, &max);

        S_i = inf(m, min, b); 
        S_s = sup(m, max, b); 

        if (S[i + 2] < S_i)
            Res[i] = S_i;
        else if (S[i + 2] > S_s)
            Res[i] = S_s;
        else
            Res[i] = ECG[i + 2];
    }

#pragma omp sections
    {
#pragma omp section
        for (i = 0; i < 1100; i++) {
            for (j = i; j < i + 4; j++) {
                fprintf(new_file, "%f  ", S[j]);
            }
            fprintf(new_file, "\n");
        }
#pragma omp section
        for (i = 0; i < 1100; i++) {
            fprintf(output_f, "%f\n", Res[i]);
        }
    }
}

double end = omp_get_wtime();
printf("\n ------------- TIMING :: Parallel Version B -------------- ");
printf("\nStart = %.16g\nend = %.16g\nDiff_time = %.16g\n", start, end, end - start);

In the first version the calculation happens in parallel with writing out S, in the second version the calculations happen first, before S and Res are written to file in parallel. I wouldn't bet on which one is faster, so just try it out on your hardware.

These can still be slower than the serial version, because spawning threads always has some overhead.

Parallel exection using OpenMP takes longer than serial execution in C?

1 Answers1