OpenACC red-black Gauss-Seidel slower than CPU

Question

I added OpenACC directives to my red-black Gauss-Seidel solver for the Laplace equation (a simple heated plate problem), but the GPU-accelerated code is no faster than the CPU, even for large problems.

I also wrote a CUDA version, and that is much faster than both (for 512x512, on the order of 2 seconds compared to 25 for CPU and OpenACC).

Can anyone think of a reason for this discrepancy? I realize that CUDA offers the most potential speed, but OpenACC should give something better than the CPU for larger problems (like the Jacobi solver for the same sort of problem demonstrated here).

Here is the relevant code (the full working source is here):

#pragma acc data copyin(aP[0:size], aW[0:size], aE[0:size], aS[0:size], aN[0:size], b[0:size]) copy(temp_red[0:size_temp], temp_black[0:size_temp])
// red-black Gauss-Seidel with SOR iteration loop
for (iter = 1; iter <= it_max; ++iter) {
  Real norm_L2 = 0.0;

  // update red cells
  #pragma omp parallel for shared(aP, aW, aE, aS, aN, temp_black, temp_red) \
      reduction(+:norm_L2)
  #pragma acc kernels present(aP[0:size], aW[0:size], aE[0:size], aS[0:size], aN[0:size], b[0:size], temp_red[0:size_temp], temp_black[0:size_temp])
  #pragma acc loop independent gang vector(4)
  for (int col = 1; col < NUM + 1; ++col) {
    #pragma acc loop independent gang vector(64)
    for (int row = 1; row < (NUM / 2) + 1; ++row) {

      int ind_red = col * ((NUM / 2) + 2) + row;        // local (red) index
      int ind = 2 * row - (col % 2) - 1 + NUM * (col - 1);  // global index

      #pragma acc cache(aP[ind], b[ind], aW[ind], aE[ind], aS[ind], aN[ind])

      Real res = b[ind] + (aW[ind] * temp_black[row + (col - 1) * ((NUM / 2) + 2)]
                         + aE[ind] * temp_black[row + (col + 1) * ((NUM / 2) + 2)]
                         + aS[ind] * temp_black[row - (col % 2) + col * ((NUM / 2) + 2)]
                         + aN[ind] * temp_black[row + ((col + 1) % 2) + col * ((NUM / 2) + 2)]);

      Real temp_old = temp_red[ind_red];
      temp_red[ind_red] = temp_old * (1.0 - omega) + omega * (res / aP[ind]);

      // calculate residual
      res = temp_red[ind_red] - temp_old;
      norm_L2 += (res * res);

    } // end for row
  } // end for col

  // update black cells
  #pragma omp parallel for shared(aP, aW, aE, aS, aN, temp_black, temp_red) \
          reduction(+:norm_L2)
  #pragma acc kernels present(aP[0:size], aW[0:size], aE[0:size], aS[0:size], aN[0:size], b[0:size], temp_red[0:size_temp], temp_black[0:size_temp])
  #pragma acc loop independent gang vector(4)
  for (int col = 1; col < NUM + 1; ++col) {
    #pragma acc loop independent gang vector(64)
    for (int row = 1; row < (NUM / 2) + 1; ++row) {

      int ind_black = col * ((NUM / 2) + 2) + row;      // local (black) index
      int ind = 2 * row - ((col + 1) % 2) - 1 + NUM * (col - 1);    // global index

      #pragma acc cache(aP[ind], b[ind], aW[ind], aE[ind], aS[ind], aN[ind])

      Real res = b[ind] + (aW[ind] * temp_red[row + (col - 1) * ((NUM / 2) + 2)]
                         + aE[ind] * temp_red[row + (col + 1) * ((NUM / 2) + 2)]
                         + aS[ind] * temp_red[row - ((col + 1) % 2) + col * ((NUM / 2) + 2)]
                         + aN[ind] * temp_red[row + (col % 2) + col * ((NUM / 2) + 2)]);

      Real temp_old = temp_black[ind_black];
      temp_black[ind_black] = temp_old * (1.0 - omega) + omega * (res / aP[ind]);

      // calculate residual
      res = temp_black[ind_black] - temp_old;       
      norm_L2 += (res * res);

    } // end for row
  } // end for col

  // calculate residual
  norm_L2 = sqrt(norm_L2 / ((Real)size));

  if(iter % 100 == 0) printf("%5d, %0.6f\n", iter, norm_L2);

  // if tolerance has been reached, end SOR iterations
  if (norm_L2 < tol) {
    break;
  }
}

OK, when I increase the problem size to 1024x1024 (`NUM=1024`), the results are more favorable: OpenMP with 4 CPU threads finishes in about 71 seconds, while the OpenACC does it in about 50. This is still much slower than the CUDA (about 22 s), though. — Kyle Niemeyer, Oct 19 '12 at 18:17

score 3 · Accepted Answer · answered Oct 19 '12 at 20:04

3

Alright, I found a semi-solution that reduces the time somewhat significantly for smaller problems.

If I insert the lines:

acc_init(acc_device_nvidia);
acc_set_device_num(0, acc_device_nvidia);

before I start my timer, in order to activate and set the GPU, the time for the 512x512 problem drops to 9.8 seconds, and down to 42 for 1024x1024. Increasing the problem size further shows how fast even OpenACC can be compared to running on four CPU cores.

With this change, the OpenACC code is on the order of 2x slower than the CUDA code, with the gap getting closer to just a bit slower (~1.2) as the problem size gets bigger and bigger.

answered Oct 19 '12 at 20:04

Kyle Niemeyer

301
2
12

acc_init takes 1~2 seconds. How much it takes to run 512x512 and 1024x1024 problems on CPU? – lashgar Oct 20 '12 at 08:14
As I mentioned, for the 512x512 problem, **without** acc_init before I start the timer, the OpenACC code ran in about 25 seconds. With it after the timer starts, it dropped to about 9.8 seconds. For 1024x1024, it went from about 50 to 42 seconds. I read elsewhere that acc_init should only take a few seconds, but the effect on my code (for smaller problems) seems to be more than this. – Kyle Niemeyer Oct 20 '12 at 23:39

score 0 · Answer 2 · edited Jun 20 '20 at 09:12

I download your full code and i compiled and run it! Did't stop run and for instruction

if(iter % 100 == 0) printf("%5d, %0.6f\n", iter, norm_L2);

the result was:

100, nan

200, nan

....

I changed all variables with type Real into type float and the result was:

100, 0.000654

200, 0.000370

..., ....

..., ....

8800, 0.000002

8900, 0.000002

9000, 0.000001

9100, 0.000001

9200, 0.000001

9300, 0.000001

9400, 0.000001

9500, 0.000001

9600, 0.000001

9700, 0.000001

CPU

Iterations: 9796

Total time: 5.594017 s

With NUM = 1024 the result was:

Iterations: 27271

Total time: 25.949905 s

OpenACC red-black Gauss-Seidel slower than CPU

2 Answers2