5

I'm trying to implement a parallel algorithm using tasks within OpenMP. The parallel programming pattern is based on the producer-consumer idea, but since the consumer process is slower than the producer, I want to use a few producers and several consumers. The main idea is to create as many OS threads as producers and then each of these will create tasks to be done in parallel (by the consumers). Every producer will be associated with a proportional number of consumers (i.e numCheckers/numSeekers). I'm running the algorithm on an Intel Dual-chip server with 6 cores per chip. The thing is that when I use only one producer(seeker) and an increasing number of consumers(checkers) the performance decays very fast as the number of consumers grows (see table below), even though the correct number of cores are working at a 100%. On the other hand, if I increase the number of producers, the average time decreases or at least stays stable, even with a proportional number of consumers. It seems to me that all the improvement is made by the division of the input among the producers, and the tasks are only bugging. But again, I don't have any explanation for the behavior with one producers. Am I missing something in the OpenMP-Task logic? Am I doing something wrong?

-------------------------------------------------------------------------
|   producers   |   consumers   |   time        |
-------------------------------------------------------------------------
|       1       |       1       |   0.642935    |
|       1       |       2       |   3.004023    |
|       1       |       3       |   5.332524    |
|       1       |       4       |   7.222009    |
|       1       |       5       |   9.472093    |
|       1       |       6       |   10.372389   |
|       1       |       7       |   12.671839   |
|       1       |       8       |   14.631013   |
|       1       |       9       |   14.500603   |
|       1       |      10       |   18.034931   |
|       1       |      11       |   17.835978   |
-------------------------------------------------------------------------
|       2       |       2       |   0.357881    |
|       2       |       4       |   0.361383    |
|       2       |       6       |   0.362556    |
|       2       |       8       |   0.359722    |
|       2       |      10       |   0.358816    |
-------------------------------------------------------------------------

The main section of my code is as fallow:

int main( int argc, char** argv) {

  // ... process the input (read from file, etc...)

  const char *buffer_start[numSeekers];
  int buffer_len[numSeekers];

  //populate these arrays dividing the input
  //I need to do this because I need to overlap the buffers for
  //correctness, so I simple parallel-for it's not enough 

  //Here is where I create the producers
  int num = 0;
  #pragma omp parallel for num_threads(numSeekers) reduction(+:num)
  for (int i = 0; i < numSeekers; i++) {
      num += seek(buffer_start[i], buffer_len[i]);
  }

  return (int*)num;
}

int seek(const char* buffer, int n){

  int num = 0;

  //asign the same number of consumers for each producer 
  #pragma omp parallel num_threads(numCheckers/numSeekers) shared(num)
  {
    //only one time for every producer
    #pragma omp single
    {
      for(int pos = 0; pos < n; pos += STEP){
    if (condition(buffer[pos])){
      #pragma omp task shared(num)
      {
        //check() is a sequential function
        num += check(buffer[pos]);
      }
    }
      }
      #pragma omp taskwait
    }
  return num;
}
ees_cu
  • 800
  • 1
  • 9
  • 20
  • You have enabled nested parallelism, haven't you? Note that the standard says that task execution might be deferred until a scheduling point is reached (e.g. `taskwait`). Also there is a data race on `num` in the tasks. You should protect the accumulation with an `atomic` construct. Keep in mind also NUMA locality if your system consists of two AMD64 chips or Nehalem and post-Nehlaem Intel chips. – Hristo Iliev Oct 23 '12 at 16:42

2 Answers2

5

The observed behaviour is due to the fact that you do not have nested parallel regions enabled. What happens is that in the first case you are actually experiencing the HUGE overhead of OpenMP tasks. This is most likely due to the fact that check() is not doing enough work compared to the overhead the OpenMP run-time introduces. Why it behaves like so with 1 and with 2 producers?

When running with just one producer, the outer parallel region is executing with one thread only. Such parallel regions are inactive according to the OpenMP API specification and they just execute the code inside serially (the only overhead being an additional function call and accessing shared variables via pointers). In this case, the inner parallel region, although being nested while nested parallelism is disabled, becomes active and spurs a lot of tasks. Tasks introduce relatively high overhead and this overhead increases with the number of threads. With 1 consumer the inner parallel region is also inactive and hence runs serially without tasks overhead.

When running with two producers, the outer parallel region is active and hence the inner parallel region is rendered inactive (remember - no nested parallelism is enabled) and as a consequence, no tasks are being created at all - seek() simply runs serially. There is no tasking overhead and the code runs almost twice as fast than the 1 producer / 1 consumer case. The run time is not dependent on the number of consumers because the inner parallel region is always inactive, no matter how many threads are specified.

How big is the overhead that tasking and concerted access to shared variables introduce? I've created a simple synthetic benchmark that executes the following code:

for (int i = 0; i < 10000000; i++) {
   ssum += sin(i*0.001);
}

Serially this executes for less than a second on a Westmere CPU with the default optimisation level of GCC 4.7.2. Then I've introduced tasks using simple atomic constructs to protect the access to the shared variable ssum:

#pragma omp parallel
{
   #pragma omp single
   for (int i = 0; i < 10000000; i++) {
      #pragma omp task
      {
         #pragma omp atomic
         ssum += sin(i*0.001);
      }
   }
}

(no need for taskwait here since there is a scheduling point at the implicit barrier at the end of the parallel region)

I've also created a more complex variant that performs reduction the same way that Massimiliano has proposed:

#define STRIDE 8

#pragma omp parallel
{
   #pragma omp single
   for (int i = 0; i < 10000000; i++) {
      #pragma omp task
      {
         const int idx = omp_get_thread_num();
         ssumt[idx*STRIDE] += sin(i*0.001);
      }
   }
   #pragma omp taskwait

   const int idx = omp_get_thread_num();
   #pragma omp atomic
   ssum += ssumt[idx*STRIDE];
}

Code was compiled with GCC 4.7.2 like:

g++ -fopenmp -o test.exe test.cc

Running it in batch mode (so no other processes could intervene) on a dual-socket Westmere system (12 cores in total) with a different number of threads and different threads placement on the sockets, one obtains the following run times for both codes:

Configuration   ATOMIC       Reduction    ATOMIC slowdown
2 + 0            2,79 ±0,15   2,74 ±0,19   1,8%
1 + 1            2,72 ±0,21   2,51 ±0,22   8,4% <-----
6 + 0           10,14 ±0,01  10,12 ±0,01   0,2%
3 + 3           22,60 ±0,24  22,69 ±0,33  -0,4%
6 + 6           37,85 ±0,67  38,90 ±0,89  -2,7%

(run times are given in seconds as measured by omp_get_wtime(), averaged over 10 runs /std. deviation also shown/; x + y in the Configuration column means x threads on the first socket and y threads on the second socket)

As you can see, the overhead from the tasks is enormous. It is much higher than the overhead from using atomic instead of applying reduction to thread-private accumulators. Besides, the assignment part of an atomic with += compiles to a locked compare-and-exchange instruction (LOCK CMPXCHG) - not much higher overhead than calling omp_get_thread_num() each time.

One should also note that the dual-socket Westmere system is NUMA since each CPU has its own memory and accesses to the memory of the other CPU go through the QPI link and are hence with increased latency (and possibly lower bandwidth). As the ssum variable is shared in the atomic case, threads that run on the second processor are essentially making remote requests. Still, the difference between both codes is negligible (except for the marked two-threads case - I have to investigate why) and the atomic code even starts to outperform the one with the reduction when the number of threads gets higher.

On multiscale NUMA systems the synchronisation in the atomic approach might become more of a burden, since it adds locking overhead to the already slower remote accesses. One such system is one of our BCS-coupled nodes. BCS (Bull Coherence Switch) is a proprietary solution from Bull that uses XQPI (eXternal QPI) to link together several Nehalem-EX boards into a single system, introducing three levels of NUMA in the way (local memory; remote memory on the same board; remote memory on a remote board). When running on one such system, consisting of 4 boards with 4 octocore Nehalem-EX CPUs each (128 cores in total), the atomic executable runs for 1036 s (!!), while the reduction approach runs for 1047 s, i.e. both still execute for about the same time (my previous statement that the atomic approach is 21,5% slower was due to OS services jitter during the measurement). Both numbers are from single runs and hence are not quite representative. Note that on this system the XQPI link introduces very high latency for inter-board QPI messages and thus locking is very expensive, but not that expensive. Part of the overhead can be taken away by using reduction, but it has to be implemented properly. First, local copies of the reduction variable should be also local to the NUMA node where the thread executes and second, one should find a way to not make calls to omp_get_thread_num(). These two could be achieved in many different ways but the easiest one is just to use threadprivate variables:

static double ssumt;
#pragma omp threadprivate(ssumt)

#pragma omp parallel
{
   ssumt = 0.0;

   #pragma omp single
   for (int i = 0; i < 10000000; i++) {
      #pragma omp task
      {
         ssumt += sin(i*0.001);
      }
   }
   #pragma omp taskwait

   #pragma omp atomic
   ssum += ssumt;
}

Access to ssumt needs no protection as two tasks seldom execute simultaneously in the same thread (have to further investigate if this conforms to the OpenMP specs). This version of the code executes for 972 s. Once again this is not that far from 1036 s and comes from a single measurement only (i.e. it could be simply a statistical fluctuation), but in theory, it should be faster.

Lessons to take home:

  • Read the OpenMP specification about nesting parallel regions. Nested parallelism is enabled either by setting the environment variable OMP_NESTED to true or by calling omp_set_nested(1);. If enabled, the level of active nesting can be controlled by the OMP_MAX_ACTIVE_LEVELS as pointed out by Massimiliano.
  • Look out for data races and try to prevent them using the most simple approach. Not every time using a more complex approach gives you better performance.
  • Special systems usually require special programming.
  • When in doubt use a thread-checker tool if available. Intel has one (commercial) and Solaris Studio (previously known as Sun Studio) from Oracle has one too (free; has Linux version despite the product name).
  • Mind the overhead! Try to split the job in big enough chunks so that the overhead from having millions of tasks created does not negate the obtained parallel gain.
Hristo Iliev
  • 72,659
  • 12
  • 135
  • 186
  • Thank you Hristo for your answer. You were right, I wasn't using nested parallelism. Once I actived nested parallelism the inner parallel block became active and now timings for more than one producer are similar to those that I showed before for only one producer. What really amazes me is how ineficient are tasks in OpenMP. I had implemented a couple of algoritms in .net using the Task class and it doesn't seems to have so huge an impact on the performance, or maybe the platform itself hide part of this overhead with its own overhead. – ees_cu Oct 24 '12 at 18:25
  • Do you recomend me to use task or I should try to schedule and split the work myself using a queue or some like that? I understand that tasks is a bad idea if the amount of work to be done by each task is too small, but in my case it was an easy way to avoid implementing the data structure that holds the work to be consumed by checkers and making it parallel safe, since this is very close to the way in which tasks should work. – ees_cu Oct 24 '12 at 18:25
  • One possible solution would be to use `parallel for schedule(dynamic,10) reduction(+:num)` instead of `single` + `task` (adjust the chunksize in `schedule` until you get the best performance). Another possible solution would be to reimplement tasking :) In the `single` block build a list of all pointers (or whatever the elements of `buffer` are) that match `condition()`. Then simply run over it with `for` with static schedule (or dynamic, if `check()` could take different amount of time to execute). – Hristo Iliev Oct 24 '12 at 19:17
  • 1
    May I also add this - use parallel constructs wisely. You code already runs for only 0.642935 seconds single-threadedly. Parallel constructs add overhead (or **lots** of overhead; depends on how you use them). Unless you have **lots** of items to process, it makes no sense to go parallel. You might also want to look at the `if` directive that allows one to selectively deactivate `parallel` regions, e.g. when the number of work items is too low for the parallel processing to be effective. – Hristo Iliev Oct 24 '12 at 19:21
0

As Hristo already suggested in a comment, you should enable nested parallelism. This is done setting the environment variables:

  • OMP_NESTED (that enables or disables nested parallelism)
  • OMP_MAX_ACTIVE_LEVELS (that controls the maximum number of nested active parallel regions)

On the other hand, rather than protecting the accumulation with an atomic construct, I would suggest the following strategy:

...
// Create a local buffer to accumulate partial results
const int nthreads = numCheckers/numSeekers;
const int stride   = 32; // Choose a value that avoids false sharing
int numt[stride*nthreads];
// Initialize to zero as we are reducing on + operator
for (int ii = 0; ii < stride*nthreads; ii++)    
  numt[ii] = 0;

#pragma omp parallel num_threads(numCheckers/numSeekers)
{

  //only one time for every producer
  #pragma omp single
  {
    for(int pos = 0; pos < n; pos += STEP){
      if (condition(buffer[pos])){
      #pragma omp task
      {
        //check() is a sequential function
        const int idx = omp_get_thread_num();
        numt[idx*stride] += check(buffer[pos]);
      }
    }
  }
  #pragma omp taskwait

  // Accumulate partial results
  const int idx = omp_get_thread_num();
  #pragma atomic
  num += numt[stride*idx];
}

This should prevent a potential slowdown due to simultaneous requests to write on the same memory location.

Note that a previous version of the answer, suggesting the use of reduction in the innermost parallel region was wrong as:

A list item that appears in a reduction clause of the innermost enclosing worksharing or parallel construct may not be accessed in an explicit task

is not allowed by the §2.9.3.6 of the OpenMP 3.1 specifications.

Massimiliano
  • 7,842
  • 2
  • 47
  • 62
  • Unfortunately your example is non-conformant. The OpenMP API specification imposes restrictions on where `reduction` variables can be accessed from and explicit tasks (as created by the `task` construct) are NOT among the allowed places (OpenMP API specification v3.1, §2.9.3.6). That's why I've recommended the use of `atomic`. Task reductions are coming in OpenMP API specs v4.0. – Hristo Iliev Oct 23 '12 at 20:22
  • @HristoIliev Damn, you are right! :-) Many thanks for pointing to §2.9.3.6 as I didn't know that. I'll remove the answer as it is evidently flawed. – Massimiliano Oct 23 '12 at 20:35
  • @HristoIliev Now it should be fine. Please tell me if you see still anything wrong. – Massimiliano Oct 23 '12 at 20:57
  • This borders with _premature optimisation_. I did a simple test, running 10 million tasks: `sum += sin(i*0.01);` both using `atomic` construct and your reduction approach. There was no noticeable difference between the two while running with 2, 6 and 12 threads on a dual-socket Westmere system. `atomic` was 8% slower _only_ in the case with two threads running on separate sockets (but not in the 3+3 threads case) while yours was slower when running with 12 threads. Your approach was 20% faster on a 128-core 3-level NUMA system though (but it requires special care with OpenMP nevertheless). – Hristo Iliev Oct 24 '12 at 12:28