2

I am writing c++ codes using OpenMP. I have a global huge array (100,000+ elements) that will be modified by adding values in a for loop. Is there a way that I can efficiently have each thread created by OpenMP for parallel maintain its local copy of array and then join after the loop? Since the number of threads is a variable, I could not create the local copies of array beforehand. If using a global copy and address the race condition by a synchronization lock, the performance is terrible.

Thanks!

Edited: Sorry for not being clear. Here's some pseudo-code hopefully could clarify the scenario:

int* huge_array=new int[N];
memset(huge_array, 0, N*sizeof(int));
#pragma omp parallel for
for (i=0; i<n; i++)
{
  get a value v independently
  get a position p independently
  // I have to set a lock here
  omp_set_lock(&lock);
  huge_array[p] += v;
  omp_unset_lock(&lock);
}

Is there a way to improve the performance of the code above?

Jose
  • 63
  • 1
  • 5
  • Make this clearer. What is your issue? change the title. – dzada Sep 12 '13 at 20:29
  • Do you mean "join" rather than "john"? If so, that's still not entirely clear. – Keith Thompson Sep 12 '13 at 20:30
  • Please provide a little pseudo code to explain your itention. It's not clear what you mean with *adding values in a for loop*. Do you mean adding (`+`) to the values in the array? Or do you mean extending the array? – Walter Sep 12 '13 at 20:35

4 Answers4

6

Okay, I finally understood what you want to do. Yes, you do it the same way as with ptreads.

std::vector<int> A(N,0);
std::vector<int*> local(omp_max_num_threads());
#pragma omp parallel
{
  int np = omp_get_num_threads();
  std::vector<int> localA(N);
  local[omp_get_thread_num()] = localA.data();

  // add values to local array
  #pragma omp for
  for(int i=0; i<num_values; ++i)
    localA[position()] += value();          // (1)
  // implicit barrier ensures all local copies are ready for aggregation

  // aggregate local copies into global array
  #pragma omp for
  for(int k=0; k<N; ++k)
    for(int p=0; p<np; ++p)
       A[k] += local[p][k];                 // (2)
  // implicit barrier ensures no local copy is deleted before aggregation is done
}

but it is important to do the aggregate also in parallel.

Walter
  • 44,150
  • 20
  • 113
  • 196
  • The position p can be repeated, but the algorithm is still deterministic because it doesn't matter in which sequence you add the values. If using Pthread, I can create a local copy of array to each thread and aggregate them afterwards. But I don't know how to do that in OpenMP since we don't explicitly create threads here. – Jose Sep 12 '13 at 21:02
  • @Jose. Sorry, I read your `+=` as `=` (that's why I thought your code would not be deterministic). – Walter Sep 12 '13 at 21:21
  • That looks like it will work. Personally I would allocate `local` inside the parallel block with `omg_get_num_threads()` rather than outside the parllel block with `omp_max_num_threads()`. Also, does the same value of position appear more than once? If so then there is a race condition. – Z boson Sep 13 '13 at 18:02
  • @redrum `local` only holds a few pointers, so allocating that outside doesn't add any penalty. Yes, positions can occur more than once, but there is no race condition: in (1) we write to a local array and in (2) we parallelise over position space (`k`). – Walter Sep 16 '13 at 09:30
  • @Walter, the point is that the number of operating thread can be less than the max so it's better to allocate inside the block in general. Let me think about the race condition. Something looks fishy. – Z boson Sep 16 '13 at 09:47
  • @redrum if you allocate inside, you need another global synchronisation barrier, and that only to avoid the allocation of a few potentially unused bytes (mind you: it's only a pointer per thread). Thus, allocating outside is *preferrable*. – Walter Sep 16 '13 at 12:09
  • You do it with a `#single` declaration. That's all. But anyway, it's too trivial to debate any further. BTW, I gave you an up vote. – Z boson Sep 17 '13 at 12:24
  • @Walter, in case you're interested, I did something very similar here [fill histograms array reduction with openmp](http://stackoverflow.com/questions/16789242/fill-histograms-array-reduction-in-parallel-with-openmp-without-using-a-critic) – Z boson Sep 17 '13 at 12:38
  • @redrum the single may be simple to issue for the programmer, but still imposes a global (implicit) synchronisation barrier; thanks for the upvote. – Walter Sep 17 '13 at 20:21
1

In Walter's answer, I believe instead of

std::vector<int*> local(omp_max_num_threads());

It should be

std::vector<int*> local(omp_get_max_threads());

omp_max_num_threads() is not a routine in OpenMP.

Community
  • 1
  • 1
HD189733b
  • 144
  • 7
0

What about using the directive

'#'pragma omp parallel for private(VARIABLE)

for your program (only with a cross, not with these '')?

EDIT: For your code I would use my directive, you won't loose so much time when locking and unlocking your variable...

EDIT 2: Sorry, you can not use my code for your problem, only, if you create a temporary array first where you store your data temporarily...

arc_lupus
  • 3,942
  • 5
  • 45
  • 81
  • 1
    If you do that with a C-style array pointer, you will get a private pointer with shared data, which will lead to incorrect results. – us2012 Sep 12 '13 at 20:34
  • But when using an array (not a pointer), it works... Let us wait for the TO's explaining. – arc_lupus Sep 12 '13 at 20:38
  • What do you mean by using your directive? Can you elaborate it? – Jose Sep 12 '13 at 20:56
  • Simply put " private(huge_array)" behind the first "for" (and remove the locking and unlocking part), then I think it should work... – arc_lupus Sep 12 '13 at 20:58
0

As far as I can tell you are essentially filling a histogram where position is the bin of the histogram to fill and value is the weight/value that you will add to that bin. Filling a histogram in parallel is equivalent to doing an array reduction. The C++ implementation of OpenMP does not have direct support for this, however, as far as I understand some version of the Fortran implementation do. To do an array reduction in C++ with OpenMP I have two suggestions.

1.) If the number of bins of the histogram (array) is much less than the number of values that will fill the histogram (which is often the preferred case since one wants reasonable statistics in each bin), then you can fill private version of the histogram in parallel and merge them in a critical section in serial. Since the number of bins is much less than the number of values this should be efficient.

2.) However, If the number of bins is large (as your example seems to imply) then it's possible to merge the private histograms in parallel as well but this is a bit more tricky. Additionally, one needs to be careful with cache alignment and false sharing.

I showed how to do both these methods and discuss some of the cache issues in the following question: Fill histograms (array reduction) in parallel with openmp without using a critical section.

Community
  • 1
  • 1
Z boson
  • 32,619
  • 11
  • 123
  • 226