1
for (uint i = 0; i < x; i++) {
   for (uint j = 0; j < z; j++) {
           if (inFunc(p, index)) {
                XY[2*nind] = i;
                XY[2*nind + 1] = j;
                nind++;
           }
   }
}

here x = 512 and z = 512 and nind = 0 initially and XY[2*x*y].

I want to optimize this for loops with openMP but 'nind' variable is closely binded serially to for loop. I have no clue because I am also checking a condition and so some of the time it will not enter in if and will skip increment or it will enter increment nind. openMP threads will increment nind variable as first come will increment nind firstly. Is there any way to unbind it. ('binding' I mean only can be implemented serially).

Chintan
  • 21
  • 5

3 Answers3

2

A typical cache-friendly solution in that case is to collect the (i,j) pairs in private arrays, then concatenate those private arrays at the end, and finally sort the result if needed:

#pragma omp parallel
{
  uint myXY[2*z*x];
  uint mynind = 0;

  #pragma omp for collapse(2) schedule(dynamic,N)
  for (uint i = 0; i < x; i++) {
    for (uint j = 0; j < z; j++) {
      if (inFunc(p, index)) {
        myXY[2*mynind] = i;
        myXY[2*mynind + 1] = j;
        mynind++;
      }
    }
  }

  #pragma omp critical(concat_arrays)
  {
    memcpy(&XY[2*nind], myXY, 2*mynind*sizeof(uint));
    nind += mynind;
  }
}

// Sort the pairs if needed
qsort(XY, nind, 2*sizeof(uint), compar);

int compar(const uint *p1, const uint *p2)
{
   if (p1[0] < p2[0])
     return -1;
   else if (p1[0] > p2[0])
     return 1;
   else
   {
     if (p1[1] < p2[1])
       return -1;
     else if (p1[1] > p2[1])
       return 1;
   }
   return 0;
}

You should experiment with different values of N in the schedule(dynamic,N) clause in order to achieve the best trade-off between overhead (for small values of N) and load imbalance (for large values of N). The comparison function compar could probably be written in a more optimal way.

The assumption here is that the overhead from merging and sorting the array is small. Whether that will be the case depends on many factors.

Hristo Iliev
  • 72,659
  • 12
  • 135
  • 186
1

Here is a variation on Hristo Iliev's good answer.

The important parameter to act on here is the index of the pairs rather than the pairs themselves.

We can fill private arrays of the pair indices in parallel for each thread. The arrays for each thread will be sorted (irrespective of the scheduling).

The following function merges two sorted arrays

void merge(int *a, int *b, int*c, int na, int nb) {
    int i=0, j=0, k=0;
    while(i<na && j<nb) c[k++] = a[i] < b[j] ? a[i++] : b[j++];
    while(i<na) c[k++] = a[i++];
    while(j<nb) c[k++] = b[j++];
}

Here is the remaining code

uint nind = 0;
uint *P;
#pragma omp parallel
{
    uint myP[x*z];
    uint mynind = 0;
    #pragma omp for schedule(dynamic) nowait
    for(uint k = 0 ; k < x*z; k++) {
        if (inFunc(p, index)) myP[mynind++] = k;
    }
    #pragma omp critical
    {
        uint *t = (uint*)malloc(sizeof *P * (nind+mynind));
        merge(P, myP, t, nind, mynind);
        free(P);
        P = t;
        nind += mynind;
    }
}

Then given an index k in P the pair is (k/z, k%z).

The merging can be improved. Right now it goes at O(omp_get_num_threads()) but it could be done in O(log2(omp_get_num_threads())). I did not bother with this.


Hristo Iliev's pointed out that dynamic scheduling does not guarantee that the iterations per thread increase monotonically. I think in practice they are but it's not guaranteed in principle.

If you want to be 100% sure that the iterations increase monotonically you can implement dynamic scheduling by hand.

Community
  • 1
  • 1
Z boson
  • 32,619
  • 11
  • 123
  • 226
  • I've completely overlooked the fact that in this case the `nowait` clause is applicable. Also, well done with the pair indexing. I thought of casting the pairs to `ulong`'s for sorting, but it won't work on little endian machines. – Hristo Iliev Feb 16 '16 at 20:46
  • @Z boson just for confirmation, you have not given memory to P. P is of size 2*x*y?? right? – Chintan Feb 17 '16 at 05:23
  • By the way, the OpenMP standard gives no guarantee that with `dynamic` schedule each thread will receive a sequence of increasingly larger logical iteration numbers: _"... the iterations are distributed to threads in the team in chunks as the threads request them. Each thread executes a chunk of iterations, then requests another chunk, until no chunks remain to be distributed."_ Therefore, your assumption that `myP` is sorted does not hold universally. – Hristo Iliev Feb 17 '16 at 08:32
  • @Chintan, I have given memory to `P`. Look in the critical section. It has size `x*y` since it stores indices of pairs and not the pairs themselves. – Z boson Feb 17 '16 at 09:25
  • @HristoIliev, I was not aware of that. I just assumed it would increase monotonically (when I simulate dynamic by hand that's how I do it). In that case sorting as you have done is the correct general solution. I guess I could implement dynamic myself and insure that it increases monotonically though. – Z boson Feb 17 '16 at 09:26
  • @HristoIliev, incidentally how would you merge as `log(nthreads)`? I mean say I have arrays per threads that I want to merge. Let's say I have eight threads. I can use 4 threads to merge the eight arrays to four arrays. Then use 2 threads to merge the four arrays to two arrays, and finally one thread to merge the finally to arrays. How would you do something like this? – Z boson Feb 17 '16 at 09:29
  • 1
    @hristo : "the OpenMP standard gives no guarantee that with dynamic schedule each thread will receive a sequence of increasingly larger logical iteration numbers". See OpenMP 4.5 monotonic and nonmonotonic schedule modifiers. Whether there is such a guarantee or not was the subject of a lot of controversy in the language committee when I asked for clarification last year! – Jim Cownie Feb 17 '16 at 09:36
  • @Zboson, honestly, I have no idea. I'm a theoretical physicist with some HPC background. Clever and efficient algorithms are not (yet) my speciality. – Hristo Iliev Feb 17 '16 at 11:16
  • @JimCownie: Apparently, I'm still stuck at version 4.0 :) Thanks for the pointer! Though, I'm often hesitant to provide answers using the latest and greatest features of a given paradigm because existing implementations tend to lag behind and posters here expect working solutions *now*. – Hristo Iliev Feb 17 '16 at 11:23
  • @Chintan, maybe you mean that I did not initilize `P` to NULL? I should have done that. That's a potential bug when I first do `free(P)`. Not that I allocate and free memory each time a thread enters the critical section. So the size keeps growing. I do this because `merge` does not merge in-place so I need to allocate a new array each time I merge and free the old one. – Z boson Feb 17 '16 at 11:50
  • 1
    @hristo: "I'm often hesitant to provide answers using the latest and greatest features of a given paradigm because existing implementations tend to lag behind and posters here expect working solutions now". Sure, that's a very reasonable position. I certainly wasn't saying that your answer should use [non]monotonic, merely pointing out that the issue of what is allowed/intended by the standard wasn't as obvious as you made out. (The debate about what the standard meant was extremely heated, people thought it was obvious, but completely disagreed on *what* was obvious!) – Jim Cownie Feb 17 '16 at 12:30
  • @JimCownie, my assertion was based entirely on my ignorance of the changes introduced in version 4.5 :-) – Hristo Iliev Feb 17 '16 at 14:20
  • @HristoIliev and Zboson I have one more question.. which is out of the topic, actually I was using linux and gcc till the date and everything was working fine with my code and openMP. now I have my project in mac also, unfortunately Apple llvm have no support of openMP. I read some stuffs and found that Apple llvm doesn't support openMP, previously Apple compiler was soft linked with gcc, I cant switch to that, but open source llvm 3.7 have support of openMP 3.1 – Chintan Feb 18 '16 at 07:08
  • so I have downloaded llvm 3.7 and clang and built it using cmake and installed it. now I don't know how to use that. because I am not using simply a Xcode I am using make to build my original project. I mean simply I don't know how to link my newly installed compiler to my g++ command. – Chintan Feb 18 '16 at 07:08
  • @Chintan, I have never developed with OSX. But I think you an find your answer on SO if you look around. Otherwise ask a new question. – Z boson Feb 18 '16 at 07:49
  • @Zboson okie Thank you guys for your help, It is really helpful. Thank you once again – Chintan Feb 18 '16 at 07:57
  • @Chintan, Clang is more or less compatible with GCC when it comes to command-line options. Just set `CC` to `clang-3.7` and `CXX` to `clang++-3.7` (or whatever the compiler executables are called in your 3.7 installation). You might have to separately download/build the [Intel OpenMP runtime](https://www.openmprtl.org/), which is required by Clang. You might also have to explicitly specify the OpenMP runtime library in the link command, e.g. `-L/path/to/IntelOpenMPruntime -liomp5`. – Hristo Iliev Feb 18 '16 at 09:01
  • @Chintan, you can also get an OpenMP-enabled GCC for OS X, either by compiling it yourself or by using one of the existing package managers (Homebrew, etc.) – Hristo Iliev Feb 18 '16 at 09:03
  • I have downloaded Intel OpenMP runtime, and build it also. so I don't need to give -fopenmp flag right?? I will just pass -liomp5 for the linking. Let me just try. Thank you. – Chintan Feb 18 '16 at 09:06
  • @HristoIliev Hi I have done that but its not working, there is no error in linking but when I put a omp_get_num_threads() it is showing 1 and omp_get_thread_num() is giving me 0 in loop, it seems there is nothing good between my code and openMP. can you suggest me why? – Chintan Feb 18 '16 at 11:17
0

The code you provide looks like you are trying to fill the XY data in sequential order. In this case OMP multithreading is probably not the tool for the job as threads (in a best case) should avoid communication as much as possible. You could introduce an atomic counter, but then again, it is probably going to be faster just doing it sequentially.

Also what do you want to achieve by optimizing it? The x and z are not too big, so I doubt that you will get a substantial speed increase even if you reformulate your problem in a parallel fashion.

If you do want parallel execution - map your indexes to the array, e.g. (not tested, but should do)

#pragma omp parallel for shared(XY)
for (uint i = 0; i < x; i++) {
   for (uint j = 0; j < z; j++) {
           if (inFunc(p, index)) {
                uint idx = (2 * i) * x + 2 * j; 
                XY[idx] = i;
                XY[idx + 1] = j;
           }
   }
}

However, you will have gaps in your array XY then. Which may or may not be a problem for you.

niosus
  • 738
  • 8
  • 22
  • Actually I am doing this task 10 times which is taking 2 seconds each so it is taking 20 seconds blocking my application. Let me just try this solution. – Chintan Feb 16 '16 at 09:05
  • @Chintan If it is truly taking 2 seconds each time, then I can only assume the `inFunc` function is a big roadblock because nothing else here should take any amount of time at all. I would suggest then that you simply apply `#pragma omp atomic` around the `nind++` line if you do not need the order to preserved (or Hristo's answer if you do need the order preserved). – NoseKnowsAll Feb 16 '16 at 17:03