Parallelize Nested Loops using OpenMP

Question

I tried to parallelize the nested loop using OpenMP, but I am not sure if this is the correct way to do that. Here is the part of the code having nested loops. This is just a generic code. I am giving noofrecords as 50k, it takes a lot of time even after parallelization. Can someone suggest any better idea to parallelize the code. I am just parallelizing the outer loop in the below code.

int ctd=0;
#pragma omp parallel for default(none), private(j,k,l), shared(A,B,C,D,ctd)
for(int j=0; j <noofrecords; j++)
{
    for( int k=0; k<noofrecords; k++)
    {
        for( int l=0; l<noofrecords; l++)
        {
            if(condition)
            {
D[ctd].a1=A[i].a1;
ctd++;
              }}}}

Be aware that your code is wrong - you cannot just increment a shared index like `cdt`. See https://stackoverflow.com/questions/43057426/openmp-multiple-threads-update-same-array — Zulan, Nov 07 '18 at 15:30
Zulan is correct. Because `ctd` is a shared variable, your algorithm is fundamentally sequential. There's not going to be any way to parallelize this effectively if condition holds true many times. Moreover, if you later need to know exactly which index of `D` corresponded to which record, then there would be no way to parallelize this algorithm at all. — NoseKnowsAll, Nov 07 '18 at 21:18
@NoseKnowsAll The condition which i have is used to join the three tables, A,B and C based on join attributes and put the result in table D. Counter is used to provide me the matched number of records in the resultant table. Thats why, I feel the counter should be shared. please let me know your comments. — Nancy, Nov 09 '18 at 21:09
@Nancy ah well then in that case you can rewrite the entire thing by keeping separate D arrays for each processor. That way, the work in this for loop in embarrassingly parallel. Then at the end, you can sequentially merge the arrays into one large `D` array. — NoseKnowsAll, Nov 10 '18 at 03:31
@NoseKnowsAll I have one more question. I am running my code on cluster, its running fine but on my local machine. its giving memory issue. Since its running without any memory issue on cluster, do i need to worry why its not running on local machine. — Nancy, Nov 10 '18 at 06:44
@Nancy That very well could be the sign of a bug. It's probably worth the effort to figure out why it's not working on your local machine. It might save you a lot of headache later. — NoseKnowsAll, Nov 10 '18 at 18:06

L.C. · Answer 1 · 2018-11-07T14:58:49.890

You can use the collapse clause, in your case you have 3 consecutive for loops so it would be something like: #pragma omp parallel for default(none), private(j,k,l), shared(A,B,C,D,ctd) collapse(3)

It will work if the for loops are consecutive and the code is in the inner-most loop (it's the case in the code you posted). It noofrecords is much bigger than your max thread count, the speedup won't be impressive. If it's slow even in parallel it probably means the bottleneck is not your processing power (more probably it's the ram already working at 100%).

Also, I'm not so sure you really want that private(j,k,l)...

Z boson · Answer 2 · 2018-11-09T13:41:41.390

create a temporary array of a1 of type D.a1 with a number of elements equal to the maximum value expected of ctd.
create a temporary array a2 of a1 for each thread.
Fill a2 in parallel and use ctd2 to count the size of a2
Fill the array a1 in order from a2 and add ctd2 to ctd
Write to D.a1 in parallel from a1

Something like this

int ctd=0;
double *a1 = malloc(sizeof *a1 * N);                       //Step 1
#pragma omp parallel
{
  int ctd2 = 0;
  double *a2 = malloc(sizeof *a2 * N);                     //step 2

  #pragma omp for nowait
  for(int j=0; j<noofrecords; j++)
  for(int k=0; k<noofrecords; k++)
  for(int l=0; l<noofrecords; l++)
    if(condition) a2[ctd2++] = A[i].a1;                    //step 3

  #pragma omp for schedule(static) ordered
  for(int i=0; i<omp_get_num_threads(); i++)
    #pragma omp ordered
    memcpy(&a1[ctd], a2, sizeof *a1 * ctd2), ctd += ctd2;  //step 4

  #pragma omp for
  for(int j=0; j<ctd; j++) D[j].a1 = a1[j];                // step 5
  free(a2);
}
free(a1);

N should be the maximum size you expect ctd to have. One memory inefficieny in this method is that a2 is allocated with size N as well which may be too large. A dynamic vector like std::vector in C++ or GArray in glib would be more memory efficient.

Parallelize Nested Loops using OpenMP

2 Answers2