C++ openmp much slower than serial implementation

Question

I am doing a thermodynamic simulation on a double dimension array. The array is 1024x1024. The while loop iterates through a specified amount of times or until goodTempChange is false. goodTempChange is set true or false based on the change in temperature of a block being greater than a defined EPSILON value. If every block in the array is below that value, then the plate is in stasis. The program works, I have no problems with the code, my problem is that the serial code is absolutely blowing the openmp code out of the water. I don't know why. I have tried removing everything except the average calculation which is just the average of the 4 blocks up, down, left, right around your desired square and still it is getting destroyed by the serial code. I've never done openmp before and I looked up some things online to do what I have. I have the variables within critical regions in the most efficient way I could see possible, I have no race conditions. I really don't see what is wrong. Any help would be greatly appreciated. Thanks.

while(iterationCounter < DESIRED_ITERATIONS && goodTempChange) {
  goodTempChange = false;
  if((iterationCounter % 1000 == 0) && (iterationCounter != 0)) {
    cout << "Iteration Count      Highest Change    Center Plate Temperature" << endl;
    cout << "-----------------------------------------------------------" << endl;
    cout << iterationCounter << "               "
         << highestChange << "            " << newTemperature[MID][MID] << endl;
    cout << endl;
  }

  highestChange = 0;

  if(iterationCounter != 0)
    memcpy(oldTemperature, newTemperature, sizeof(oldTemperature));

  for(int i = 1; i < MAX-1; i++) {  
  #pragma omp parallel for schedule(static) 
    for(int j = 1; j < MAX-1; j++) {
      bool tempGoodChange = false;
      double tempHighestChange = 0;
      newTemperature[i][j] = (oldTemperature[i-1][j] + oldTemperature[i+1][j] +
                              oldTemperature[i][j-1] + oldTemperature[i][j+1]) / 4;

      if((iterationCounter + 1) % 1000 == 0) {
        if(abs(oldTemperature[i][j] - newTemperature[i][j]) > highestChange)
          tempHighestChange = abs(oldTemperature[i][j] - newTemperature[i][j]);
        if(tempHighestChange > highestChange) {
          #pragma omp critical
          {
            if(tempHighestChange > highestChange)
              highestChange = tempHighestChange;
          }
        }
      }
      if(abs(oldTemperature[i][j] - newTemperature[i][j]) > EPSILON
         && !tempGoodChange)
        tempGoodChange = true;

      if(tempGoodChange && !goodTempChange) {
        #pragma omp critical
        {
          if(tempGoodChange && !goodTempChane)
            goodTempChange = true;
        }
      }
    }
  }
  iterationCounter++;
}

Have you considered the effects of your computation on the cache? — Ira Baxter, Oct 03 '13 at 04:34
Parallel computation works best when you can divide up your work in such a way that different cores can **independently** do the work without having to "talk" to other cores. The less data you share between threads, the better. For example, you may want to look and see if your code suffers from false sharing effects. Also consider the overhead of launching and scheduling threads. — In silico, Oct 03 '13 at 04:37
@IraBaxter I've not considered anything with the cache quite honestly I'm not sure how to even go about that. — user1830219, Oct 03 '13 at 04:57
@Insilico I think I covered any sharing effects through critical pragma as well as private temp variables defined within the loop. — user1830219, Oct 03 '13 at 04:58
For nested loop it is better to parallelize external loop, and not internal, as you do. — Alex F, Oct 03 '13 at 06:03
@AlexFarber that still shouldn't result in a slow down though right? — user1830219, Oct 03 '13 at 11:16
@AlexFarber i switched it to the outside and it is still running slowly — user1830219, Oct 03 '13 at 11:17
Actually they ran about the same time. I believe that in openmp 2.0 when you parallelize the outside loop it does not do the inner one and you have to put the pragma on the inner loop. So that is why they are now running at about the same speed. — user1830219, Oct 03 '13 at 11:37
@Insilico I think there is not much false sharing here (though definitely more than when parallelising the outer loop; but the OP presumably didn't understand that part of your comment anyway). I doubt that the contention on the critical sections is performance critical. — Walter, Oct 03 '13 at 17:50
@user1830219 a trivial question: did you check that your openMP code actually uses > 1 thread? (write out form within a parallel region)! — Walter, Oct 03 '13 at 18:01
@walter Yes it does. I changed the number of threads it uses using num_threads in the pragma statement and 1 thread is the fastest, the more threads I add (up to 4 i have a quad core) the slower it gets. — user1830219, Oct 03 '13 at 19:54
@user1830219 This is very mysterious and makes very little sense. From what you provided, this should nicely parallelise. Thus, it looks like that something outside the part of code shown is at fault. I think if you want us to help you solving this problem you must provide a SSCCE. — Walter, Oct 04 '13 at 08:40

score 1 · Answer 1 · answered Oct 03 '13 at 12:35

1

Trying to get rid of those critical sections may help. For example:

#pragma omp critical
{
  if(tempHighestChange > highestChange)
  {
    highestChange = tempHighestChange;
  }
}

Here, you can store the highestChange computed by each thread in a local variable and, when the parallel section finishes, get the maximum of the highestChange's you have.

answered Oct 03 '13 at 12:35

ChronoTrigger

8,459
1
36
57

OpenMP 3 also supports a max reduction: `reduction(max:highestChange)` – pburka Oct 03 '13 at 18:41

Walter · Answer 2 · 2013-10-03T17:58:55.020

Here is my attempt (not tested).

double**newTemperature;
double**oldTemperature;

while(iterationCounter < DESIRED_ITERATIONS && goodTempChange) {
  if((iterationCounter % 1000 == 0) && (iterationCounter != 0))
    std::cout
      << "Iteration Count      Highest Change    Center Plate Temperature\n"
      << "---------------------------------------------------------------\n" 
      << iterationCounter << "               "
      << highestChange << "            "
      << newTemperature[MID][MID] << '\n' << std::endl;

  goodTempChange = false;
  highestChange  = 0;

  // swap pointers to arrays (but not the arrays themselves!)
  std::swap(newTemperature,oldTemperature);
  if(iterationCounter != 0)
    std::swap(newTemperature,oldTemperature);

  bool CheckTempChange = (iterationCounter + 1) % 1000 == 0;
#pragma omp parallel
  {
    bool localGoodChange = false;
    double localHighestChange = 0;
#pragma omp for
    for(int i = 1; i < MAX-1; i++) {
      //
      // note that putting a second
      // #pragma omp for
      // here has (usually) zero effect. this is called nested parallelism and
      // usually not implemented, thus the new nested team of threads has only
      // one thread.
      //
      for(int j = 1; j < MAX-1; j++) {
        newTemperature[i][j] = 0.25 *   // multiply is faster than divide
          (oldTemperature[i-1][j] + oldTemperature[i+1][j] +
           oldTemperature[i][j-1] + oldTemperature[i][j+1]);
        if(CheckTempChange)
          localHighestChange =
            std::max(localHighestChange,
                     std::abs(oldTemperature[i][j] - newTemperature[i][j]));
        localGoodChange = localGoodChange ||
          std::abs(oldTemperature[i][j] - newTemperature[i][j]) > EPSILON;
        // shouldn't this be < EPSILON? in the previous line?
      }
    }
    //
    // note that we have moved the critical sections out of the loops to
    // avoid any potential issues with contentions (on the mutex used to
    // implement the critical section). Also note that I named the sections,
    // allowing simultaneous update of goodTempChange and highestChange
    //
    if(!goodTempChange && localGoodChange)
#pragma omp critical(TempChangeGood)
      goodTempChange = true;
    if(CheckTempChange && localHighestChange > highestChange)
#pragma omp critical(TempChangeHighest)
      highestChange = std::max(highestChange,localHighestChange);
  }
  iterationCounter++;
}

There are several changes to your original:

The outer instead of the inner of the nested for loops is performed in parallel. This should make a significant difference. added in edit: It appears from the comments that you don't understand the significance of this, so let me explain. In your original code, the outer loop (over i) was done only by the master thread. For every i, a team of threads was created to perform the inner loop over j in parallel. This creates an synchronisation overhead (with significant imbalance) at every i! If one instead parallelises the outer loop over i, this overhead is encountered only once and each thread will run the entire inner loop over j for its share of i. Thus, always to parallelise the outermost loop possible is a basic wisdom for multi-threaded coding.
The double for loop is inside a parallel region to minimise the critical region calls to one per thread per while loop. You may also consider to put the whole while loop inside a parallel region.
I also swap between two arrays (similar as suggested in other answers) to avoid to memcpy, but this shouldn't really be performance critical. added in edit: std::swap(newTemperature,oldTemperature) only swaps the pointer values and not the memory pointed to, of course, that's the point.

Finally, don't forget that the proof of the pudding is in the eating: just try what difference it makes to have the #pragma omp for in front of the inner or the outer loop. Always do experiments like this before asking on SO -- otherwise you can be rightously accused of not having done sufficient research.

How does swap avoid effectively doing a memcpy (behind the scenes)? And why is it important to preserve the *content* of the 2nd to last iteration (e.g., doesn't the swap actually copy its data too)? — Ira Baxter, Oct 03 '13 at 15:19
I think that in openmp 2.0 when you parallelize a nested for loop if you make the outer loop parallel the inner loop will not be a parallel for. There is a command in newer versions that allows you to do this. Though — user1830219, Oct 03 '13 at 15:27
The minimizing of critical calls by putting the nested loop inside a parallel region would definitely help though if that works. I'll have to try it — user1830219, Oct 03 '13 at 15:32
@user1830219 your comments demonstrate little experience/understanding of parallel coding. I have therefore edited my answer to explain things a bit more. I hope that makes more sense for you now. — Walter, Oct 03 '13 at 17:43
@Walter yes this is the first time I've done parallel programming I am just learning it — user1830219, Oct 03 '13 at 19:55
@Walter as for the last edit I did try moving the pragma statement to the outer loop before posting, it made the code the code equally as fast as the serial version. — user1830219, Oct 03 '13 at 19:57

score -1 · Answer 3 · edited May 23 '17 at 12:03

-1

I assume that you are concerned with the time taken by the entire code inside the while loop, not just by the time taken by the loop beginning for(int i = 1; i < MAX-1; i++).

This operation

if(iterationCounter != 0)
{
    memcpy(oldTemperature, newTemperature, sizeof(oldTemperature));
}

is unnecessary and, for large arrays, may be enough to kill performance. Instead of maintaining 2 arrays, old and new, maintain one 3D array with two planes. Create two integer variables, let's call them old and new, and set them to 0 and 1 initially. Replace

newTemperature[i][j] = ((oldTemperature[i-1][j] +  oldTemperature[i+1][j] + oldTemperature[i][j-1] + oldTemperature[i][j+1]) / 4);

by

temperature[new][i][j] = 
  (temperature[old][i-1][j] +
   temperature[old][i+1][j] +
   temperature[old][i][j-1] +
   temperature[old][i][j+1])/4;

and, at the end of the update swap the values of old and new so that the updates go the other way round. I'll leave it to you to determine whether old/new should be the first index into your array or the last. This approach eliminates the need to move (large amounts of) data around in memory.

Another possible cause of serious slowdown, or failure to accelerate, is covered in this SO question and answer. Whenever I see arrays with sizes of 2^n I suspect cache issues.

edited May 23 '17 at 12:03

Community

1
1

answered Oct 03 '13 at 11:59

High Performance Mark

77,191
7
105
161

@Walter: your suggested edit materially changed my answer so I've rolled it back. – High Performance Mark Oct 03 '13 at 12:26
Yes, my edit rendered your code correct. Other than that, I'm not aware of a "material change", whatever that means. Perhaps you're not familiar with `C++` (this is a C++ question), but `new` is not allowed as variable name in `C++`. – Walter Oct 03 '13 at 12:30
Hmm. You replaced memcpy with a 3D array with 2 planes and used an index to indicate which plane is to/from. Doesn't that make you a bit dependent on the compiler doing a really good job of avoiding recomputing the 3D coordinate? I agree, most good compilers ought to do this. I would have declared two 2D arrays A and B, and called a top level do-the-work subroutine that accepted them as to/from (reference) arguments. Call the subroutine *twice*, once with A,B, once with B,A. Now the compiler doesn't have to do that 3rd axis at all. – Ira Baxter Oct 03 '13 at 14:12
@IraBaxter: yes, what you suggest is another approach. I just don't like all that shuffling of data around memory, can't see it doing anything other than slow things down, though by how much I wouldn't care to speculate. – High Performance Mark Oct 03 '13 at 14:43
@IraBaxter: no, I wasn't accusing your scheme of shuffling data, I was concurring with you (and with myself) that pointer re-pointing, or swapping indices, is a better approach. *I just don't like* applied to OP's original scheme. – High Performance Mark Oct 03 '13 at 15:18
Oops, typo in my last comment, deleted and fixed here: "Like your scheme, my scheme doesn't do any data shuffling. It (has to) pass pointers to the arrays to the subroutines. The arrays don't move, they just alternate being supplier and recipient.". Yes, I can see we're in agreement. – Ira Baxter Oct 03 '13 at 15:22

C++ openmp much slower than serial implementation

3 Answers3