Simple OpenMP Loop very slow when scaled

Question

this should be a fairly simple, but I'm running into an issue trying to run a basic nested for loop in OpenMP

for(z=start;z<=end;z++){
    offset=sizeof(int)*(z*r*c);
    fseek(fpIn,offset,SEEK_SET);
    fread(tempbuffer,sizeof(int),r*c,fpIn);

    #pragma omp parallel for collapse(2) private(x,y,z) schedule(static)
    for(y=0;y<c;y++){
        for(x=0;x<r;x++){
            if(z>=z0 && z<z1 && y>=y0 && y<y1 && x>=x0 && x<x1){
                volbuffer[y*c+x] = proc(tempbuffer[y*c+x]);
            }
        }
    }
    fseek(fpOut,offset,SEEK_SET);
    fwrite(volbuffer,sizeof(int),r*c,fpOut);
}

where proc(); is a function that does some very basic arithmetic on the input value. However, it turns out to be super slow when ran. The #pragma only affects a simple 1D array. The volbuffer and tempbuffer are the same size, I just read into temp to reduce a possibility of false sharing, yet this still scales very very poorly. What am I doing wrong here?

Here r and c are sides of a matrix. What I'm trying to do is edit each value of the matrix. proc(val) function only consists of return val + 5 so it shouldn't take long.

Running it on a 9x9 matrix (r=c=9) I get the following benchmarks: Without OpenMP:

0.290381 seconds
0.287123 seconds
0.293081 seconds
0.298092 seconds

With OpenMP:

0.516495 seconds
0.511104 seconds
0.508267 seconds
0.521731 seconds

I'm using an i7 8550U if it is of significance

When you claim poor scaling, you should provide the numbers. What are the values of `c` and `r`? How long does a single call to `proc()` take? What scaling do you get? What do you expect? Give numbers, hardware description, if necessary show the source code of `proc`. I doubt anyone here has a crystal ball that can tell them those details. — Hristo Iliev, May 24 '20 at 19:21
I've added some extra info which I hope will be useful, anything else I should clear up? — Ilknur Mustafa, May 24 '20 at 19:38
So you have 81 calls to a function that adds 5? Just setting up the OpenMP parallel region takes orders of magnitude longer time. Besides, `collapse(2)` results in `x` and `y` being recreated from a single linear index, which introduces integer division and modulo operations - not the fastest operations around. — Hristo Iliev, May 24 '20 at 20:13
Just want to check that you are timing your program properly. If you measure CPU clocks, then OpenMP code will show the combined CPU clocks on ALL cores. Check this out to make sure you are using the right timing: https://stackoverflow.com/questions/10874214/measure-execution-time-in-c-openmp-code — Warpstar22, May 25 '20 at 21:00
With such timings a so small matrix, I guess that `end-start` should be big. Thus, almost all the time should be spent by the IO-based operations and so it is probably better to focus on this part... — Jérôme Richard, May 26 '20 at 10:57

Simple OpenMP Loop very slow when scaled

0 Answers0