OpenMP on nested for() loops in creating 3D Grid

Question

I'm trying to make use of openmp to parallelise the function below in creating a 3D grid. Classes Point, Triangle, and Box have been declared & structured but the code is not pasted here as it's quite long to include in the question (about 110 lines). If you need it and allow me to paste it, then I can edit this question.

I've tried using #pragma omp parallel for on the outermost for() loop and the program crashed as I run it. Then, I tried implementing the following #pragma codes below and the program took a longer time to run it (compared to the serial coding). I referred to this link here in implementing it. Actually first of all, is it possible to parallelise the code below using openmp? If yes, then I really need your help in this.

int main()
{
    vector<Box> boxes;
    Triangle dummy1;

    float dx = 81.0, dy = 121.0, dz = 98.0, delta=1.0;
    long int nx = 5*ceil(dx/(5.0*delta));
    long int ny = 5*ceil(dy/(5.0*delta));
    long int nz = 5*ceil(dz/(5.0*delta));
    long int Box_ID=1, Pt_ID=1;
    float x_val=0.0, y_val=0.0, z_val=0.0;
    float x0=-42.0f, y0=-3.0f, z0=-52.0f;

    long int i, j, k;
    for(i=0; i<nz+2; i++)
    {
        z_val=i*delta + z0;
        for(j=0; j<ny+2; j++)
        {
            y_val=j*delta + y0;
            #pragma omp parallel
            {
                vector<Box> box_private;
                #pragma omp for nowait schedule(static)
                for(k=0; k<nx+2; k++)
                {
                    x_val=k*delta + x0;
                    Point x1(x_val, y_val, z_val, Pt_ID);
                    Point x2(x_val+delta, y_val, z_val, Pt_ID+1);
                    Point x3(x_val, y_val+delta, z_val, Pt_ID+2);
                    Point x4(x_val+delta, y_val+delta, z_val, Pt_ID+3);
                    Point x5(x_val, y_val, z_val+delta, Pt_ID+4);
                    Point x6(x_val+delta, y_val, z_val+delta, Pt_ID+5);
                    Point x7(x_val, y_val+delta, z_val+delta, Pt_ID+6);
                    Point x8(x_val+delta, y_val+delta, z_val+delta, Pt_ID+7);
                    box_private.push_back(Box(x1,x2,x3,x4,x5,x6,x7,x8,dummy1,Box_ID));
                    Box_ID++;
                    Pt_ID++;
                }
                #pragma omp for schedule(static) ordered
                for(int i=0; i<omp_get_num_threads(); i++)
                {
                    #pragma omp ordered
                    boxes.insert(boxes.end(), box_private.begin(), box_private.end());
                }
            }
        }
    }
}

Program crashes when I implement the codes below instead of the above one.

    #pragma omp parallel for
    for(i=0; i<nz+2; i++)
    {
        z_val=i*delta + z0;
        for(j=0; j<ny+2; j++)
        {
            y_val=j*delta + y0;
            for(k=0; k<nx+2; k++)
            {
                x_val=k*delta + x0;
                /* Point x1 to x8...*/
                boxes.push_back(Box(x1,x2,x3,x4,x5,x6,x7,x8,dummy1,Box_ID));
                Box_ID++;
                Pt_ID++;
            }
        }
    }

Probaly you have to little code to run inside the parallel section so your vectors copying and threads synchronization overhead eliminates all the multithreading profit. — Lol4t0, Apr 18 '15 at 18:44
You may gain performqance if you still try parallelize outermost loop, but I can't say now why it crashed when you tried. — Lol4t0, Apr 18 '15 at 18:47
@Lol4t0 I myself am not sure why does it crash. I've read some guides mentioning about parallelising outermost loop. Indeed it helps but just that this code, I'm not sure why it won't work for that. — Amos Chew, Apr 18 '15 at 19:20
I mean, you probably should show code, with the outermost loop paralleled, that breaks. — Lol4t0, Apr 18 '15 at 19:34
@Lol4t0 I've edited the question above. Basically I just plug in the #pragma line above and yea...it crashed as I run it... — Amos Chew, Apr 18 '15 at 19:46

score 3 · Answer 1 · edited May 23 '17 at 10:10

The problem with your code is that you don't appreciate the difference between shared and private variables and therefore you have multiple race conditions. Variables defined inside a parallel section are private and those defined outside are shared. Since you defined everything outside (except for box_private) everything is shared (except for box_private and k which OpenMP makes private anyway).

But even making the appropriate variables private won't fix your problem because of Box_ID++ and Pt_ID++. You could fix them with atomic but that's unnecessary and inefficient. If you define Box_ID = ny*nx*i + nx*j + k; (same for Pt_ID) then your code should work.

for(long i=0; i<nz+2; i++) {
    float z_val=i*delta + z0;
    for(long j=0; j<ny+2; j++) {
        float y_val=j*delta + y0;
        #pragma omp parallel
        {
            vector<Box> box_private;
            #pragma omp for nowait schedule(static)
            for(long k=0; k<nx+2; k++) {
                float x_val=k*delta + x0;
                long Box_ID = ny*nx*i + nx*j + k;
                long Pt_ID  = ny*nx*i + nx*j + k;                    
                Point x1(x_val, y_val, z_val, Pt_ID);
                // Point x2 - x8
                box_private.push_back(Box(x1,x2,x3,x4,x5,x6,x7,x8,dummy1,Box_ID));

But I think you should step back and think a bit about what you want to do. If you know ahead of time how many boxes you will fill then there is no reason to use push_back or a private vector. In fact you should be able to do

#pragma omp parallel for schedule(static)
for(long i=0; i<nz+2; i++) {
    float z_val=i*delta + z0;
    for(long j=0; j<ny+2; j++) {
        float y_val=j*delta + y0;
        for(long k=0; k<nx+2; k++) {
            float x_val=k*delta + x0;
            long Box_ID = ny*nx*i + nx*j + k;
            long Pt_ID  = ny*nx*i + nx*j + k;               
            Point x1(x_val, y_val, z_val, Pt_ID);
            // Point x2 - x8
            boxes[ny*nx*i + nx*j +k] = Box(x1,x2,x3,x4,x5,x6,x7,x8,dummy1,Box_ID);
        }
    }
}

where boxes is a C array or if it's a std:vector make sure to resize it.

Lastly you can collapse the three loop if you move z_val and y_val insdide the loop over k with x_val and then do #pragma omp parallel for schedule(static) collapse(3). But you can also do it by hand like this

long x = nz + 2, y = ny + 2, z = nx + 2;
#pragma omp parallel for schedule(static)
for(long n=0; n<(x*y*z); n++) {
    long i = n/(y*z);
    long j = (n%(y*z))/z;
    long k = (n%(y*z))%z;
    z_val=i*delta + z0;
    y_val=j*delta + y0;
    x_val=k*delta + x0;
    Point x1(x_val, y_val, z_val, n+1);
    // Point x2 - x8
    boxes[n] = Box(x1,x2,x3,x4,x5,x6,x7,x8,dummy1,n+1);
}

Do you really need Box_ID and Pt_ID then?

Yes, i do need `Box_ID` and `Pt_ID`. I've already done before this on what you've just mentioned (the last method combining the 3 for loops). But now I'm facing a new problem whereby if i change `delta=0.5`, i'll get a `...Runtime terminated...std::bad_alloc()...` which i notice it comes from if `box_private.size()` is larger than `box_private.capacity()`. I posted this problem [here](http://stackoverflow.com/questions/29735189/situations-faced-in-openmp-on-for-loops) and still awaiting for someone to help. I've been researching and stuck on this for quite some time. — Amos Chew, Apr 20 '15 at 08:23
@AmosChew, your current code does not show you need them or why you need a `std:vector` at all. — Z boson, Apr 20 '15 at 08:26
Sorry...forgotten to mentioned that i still use the vectors but of a similar implementation in combining the 3 for loops — Amos Chew, Apr 20 '15 at 08:28
owh...your method works even for smaller `delta` values....indeed changing the capacity n resizing before-hand, it worked. Sorry but ignore my 1st comment and many thanks for your help. i'll change the other parts as well..actually...what if it's still in vectors as in `push_back` and stuffs, is it possible to resize n change capacity in between parallelisation? — Amos Chew, Apr 20 '15 at 08:45
See my last code example [here](http://stackoverflow.com/questions/18669296/c-openmp-parallel-for-loop-alternatives-to-stdvector/18671256#18671256) for an example of resizing. — Z boson, Apr 20 '15 at 08:50
ahh...i've tried your code as well...it still returns `std::bad_alloc` problem....how abt whether is there a method to change `vec_private.capacity()` as in `vec_private.reserve()` command somewhere in between..? cause otherwise it'll still get `runtime terminated for bad_alloc` — Amos Chew, Apr 20 '15 at 09:03

Lol4t0 · Accepted Answer · 2015-04-20T11:42:14.220

0

Your first code variant with parallelism on the inner loops works slow because overhead on managing threads and merging vectors takes too much time comparing to the actual code running. Another reason for slow down is extensive use of shared variables. while OpenMP synchronizes access to them, it takes a lot of time. Generally you should try to minimize shared variables usage. I believe good style would be specifying default(none) for your parallel section and explicitly specifying variables you want to make shared.

To improve proportion between actual code and management code you should make parallel regions as large as possible and use as less as possible synchronization between threads. So you should parallel your outermost loop for best effect.

In your second approach however you forgot about synchronization. You cannot insert into the same boxes vector from different threads. You could apply the synchronization mechanism which you used in your first approach in just exactly same manner.

edited Apr 20 '15 at 11:42

answered Apr 18 '15 at 19:54

Lol4t0

12,444
4
29
65

Followed what you said and it worked. And when I change to `delta=0.9` and run it, time taken is shorter by half a sec which I believe the result would be better if `delta` value is smaller. – Amos Chew Apr 18 '15 at 20:13
1

The OPs code is slow because it has several race conditions none of which you addressed. – Z boson Apr 20 '15 at 07:36
@Zboson, you are right. I didn't study original code attentively, pointing some most obvious things. So I would not now change cardinally the answer as you already pointed them in your answer. – Lol4t0 Apr 20 '15 at 11:37

OpenMP on nested for() loops in creating 3D Grid

2 Answers2