OpenMP drastically slows down for loop

Question

I am attempting to speed up this for loop with OpenMP parallelization. I was under the impression that this should split up the work across a number of threads. However, perhaps the overhead is too large for this to give me any speedup.

I should mention that this loop occurs many many many times, and each instance of the loop should be parallelized. The number of loop iterations, newNx, can be as small as 3 or as large as 256. However, if I conditionally have it parallelized only for newNx > 100 (only the largest loops), it still slows down significantly.

Is there anything in here which would cause this to be slower than anticipated? I should also mention that the vectors A,v,b are VERY large, but access is O(1) I believe.

    #pragma omp parallel for private(j,k),shared(A,v,b)
    for(i=1;i<=newNx;i+=2) {
      for(j=1;j<=newNy;j++) { 
        for(k=1;k<=newNz;k+=1) {

          nynz=newNy*newNz; 

          v[(i-1)*nynz+(j-1)*newNz+k] = 
          -(v[(i-1)*nynz+(j-1)*newNz+k+1 - 2*(k/newNz)]*A[((i-1)*nynz + (j-1)*newNz + (k-1))*spN + kup+offA] + 
          v[(i-1)*nynz+(j-1)*newNz+ k-1+2*(1/k)]*A[((i-1)*nynz + (j-1)*newNz + (k-1))*spN + kdo+offA] + 
          v[(i-1)*nynz+(j - 2*(j/newNy))*newNz+k]*A[((i-1)*nynz + (j-1)*newNz + (k-1))*spN + jup+offA] + 
          v[(i-1)*nynz+(j-2 + 2*(1/j))*newNz+k]*A[((i-1)*nynz + (j-1)*newNz + (k-1))*spN + jdo+offA] + 
          v[(i - 2*(i/newNx))*nynz+(j-1)*newNz+k]*A[((i-1)*nynz + (j-1)*newNz + (k-1))*spN + iup+offA] + 
          v[(i-2 + 2*(1/i))*nynz+(j-1)*newNz+k]*A[((i-1)*nynz + (j-1)*newNz + (k-1))*spN + ido+offA] - 
          b[(i-1)*nynz + (j-1)*newNz + k])
          /A[((i-1)*nynz + (j-1)*newNz + (k-1))*spN + ifi+offA];}}}

I too have to slow down drastically to read any of this... I wouldn't blame OpenMP... — Kerrek SB, Sep 11 '13 at 19:31
Just five letters: [IOCCC](http://www.ioccc.org/) (Seriously, If you take the effort to make this code more readable, maybe somebody will give you some hints) — Massimiliano, Sep 11 '13 at 19:44
OpenMP handles *Highly Parallel* loads quite well, meaning that each iteration of the loop *MUST* be independent from all other iterations. Secondly you need to have enough iterations and a large enough workload to make it worthwhile, having a small loop where each iteration executes fast is usually not worth it, even if there are many iterations. — Mgetz, Sep 11 '13 at 20:01
Thanks... It's basically a bunch of array calls on 3 large arrays, if you don't feel like reading it — user2770042, Sep 11 '13 at 20:04
Looking over your iteration count... OpenMP is not going to help this, for OpenMP to help you need much higher iteration counts and each iteration of the loop needs to be much more significant. You are probably running into other issues such as locality of reference. I would highly suggest running a profiler to figure out what needs to be optimized instead of just jumping into it. — Mgetz, Sep 11 '13 at 20:09
OK, thank you for the quick and informative response. I will look into it asap. — user2770042, Sep 11 '13 at 20:11

Z boson · Answer 1 · 2013-10-29T12:43:27.083

6

Assuming you don't have a race condition you can try fusing the loops. Fusing will give larger chunks to parallelize which will help reduce the effect of false sharing and likely distribute the load better as well.

For a triple loop like this

for(int i2=0; i2<x; i2++) {
    for(int j2=0; j2<y; j2++) {
        for(int k2=0; k2<z; k2++) {
            //
        }
    }
}

you can fuse it like this

#pragma omp parallel for
for(int n=0; n<(x*y*z); n++) {
    int i2 = n/(y*z);
    int j2 = (n%(y*z))/z;
    int k2 = (n%(y*z))%z;
    //
}

In your case you you can do it like this

int i, j, k, n;
int x = newNx%2 ? newNx/2+1 : newNx/2;
int y = newNy;
int z = newNz;

#pragma omp parallel for private(i, j, k)
for(n=0; n<(x*y*z); n++) {
    i = 2*(n/(y*z)) + 1;
    j = (n%(y*z))/z + 1;
    k = (n%(y*z))%z + 1;
    // rest of code
}

If this successfully speed up your code then you can feel good that you made your code faster and at the same time obfuscated it even further.

edited Oct 29 '13 at 12:43

answered Sep 12 '13 at 11:52

Z boson

32,619
11
123
226

1

+1 just because you even attempted to read that code. Wouldn't fusing also be achieved with `collapse` directives? – rath Oct 24 '13 at 01:40
1

Yes, but that requires a version of OpenMP not supported by MSVC and I like to have code that works with multiple compilers so I just do the fusing myself. – Z boson Oct 24 '13 at 09:31
Suggestion (requires math.h): int x = ceil(newNx/2); – ofer.sheffer Apr 12 '17 at 16:03

OpenMP drastically slows down for loop

1 Answers1

Linked