OpenMP parallel version running slower than serial

Question

I have this code that I parallelized using OpenMP that seems to run slower than the serial version. Here's the relevant fragment of the code:

Out_props ion_out;

#pragma omp parallel for firstprivate(Egx,Egy,vi_inlet,dt,xmin,xmax,ymin,ymax,qmi,dy,Nx) private(ion_out)
for (int i=0;i<Np;i++)
{
    ion_out = ApplyReflectionBC(dt,Nx,xmin,xmax,ymin,ymax,qmi,dy,vi_inlet,Egx,Egy,xi_i[2*i],xi_i[1+2*i],vi_i[2*i],vi_i[1+2*i]);

    xi_o[1-1+2*i]=ion_out.xout;
    xi_o[2-1+2*i]=ion_out.yout;
    vi_o[1-1+2*i]=ion_out.vxout;
    vi_o[2-1+2*i]=ion_out.vyout;
}

Here outprops is just a structure with 4 members of the double type. The ApplyReflectionBC functions (given below) just applies some operations for each i. All these operations are completely independent of each other. Egx and Egy are 60x60 matrices defined prior to entering this loop and vi_inlet is 2x1 vector. I've tried making ion_out a matrix of size Np to further increase independence, but that seems to make no difference. Everything else inside firstprivate is a double type defined prior to entering this loop.

I'd really appreciate any insights into why this might be running many times slower than the serial version. Thank you!

Out_props ApplyReflectionBC(double dt,int Nx,double xmin,double xmax,double ymin, double ymax,double qmp, double dy, double *vp_inlet,double *Egx,double *Egy, double xpx,double xpy,double vpx,double vpy)
{
    Out_props part_out;
    double Lgy=ymax-ymin;
    double xp_inp[2]={xpx,xpy};
    double vp_inp[2]={vpx,vpy};
    double xp_out[2];
    double vp_out[2];

    struct vector
    {
        double x;
        double y;
    }vnmf,Ep,xnmf;


    if((xp_inp[1-1]>xmin) && (xp_inp[1-1]<xmax) && (xp_inp[2-1]<ymin)) //ONLY below lower wall
    {
       xp_out[1-1]=xp_inp[1-1];
       xp_out[2-1]=ymin;

       vp_out[1-1]=vp_inp[1-1];
       vp_out[2-1]=-vp_inp[2-1];
    }
    else if((xp_inp[1-1]<xmin) || (xp_inp[1-1]>xmax) || (xp_inp[2-1]>ymax))
    {//Simple Boris Push
        xnmf.x=xmin;
        xnmf.y=ymin+Lgy*rand()/RAND_MAX;

        vnmf.x=vp_inlet[0];
        vnmf.y=vp_inlet[1];

        //Find E field at x,y
        double yjp=ymin+dy*floor((xnmf.y-ymin)/(1.0*dy));
        double yjp1p=yjp+dy;

        int kp=(yjp-ymin)/dy;

        int kpp1=kp+1;
        double ylg=xnmf.y-yjp;

        double wjk=1.0*(dy-ylg)/(1.0*dy);
        double wjkp1=1.0*ylg/(1.0*dy);

        Ep.x=wjk*Egx[Nx*kp]+wjkp1*Egx[Nx*kpp1];

        Ep.y=wjk*Egy[Nx*kp]+wjkp1*Egy[Nx*kpp1];

        do
        {
            double f=1.0*rand()/RAND_MAX;
            xp_out[1-1]=xnmf.x+f*dt*(vnmf.x+qmp*Ep.x*f*dt/2.0);
            xp_out[2-1]=xnmf.y+f*dt*(vnmf.y+qmp*Ep.y*f*dt/2.0);

            vp_out[1-1]=vnmf.x+qmp*Ep.x*(f-0.5)*dt;
            vp_out[2-1]=vnmf.y+qmp*Ep.y*(f-0.5)*dt;

        } while((xp_out[1-1]<xmin) || (xp_out[1-1]>xmax) || (xp_out[2-1]<ymin)  || (xp_out[2-1]>ymax));
    }
    else
    {
        xp_out[1-1]=xp_inp[1-1];
        xp_out[2-1]=xp_inp[2-1];

        vp_out[1-1]=vp_inp[1-1];
        vp_out[2-1]=vp_inp[2-1];
    }

    part_out.xout=xp_out[0];
    part_out.yout=xp_out[1];
    part_out.vxout=vp_out[0];
    part_out.vyout=vp_out[1];

    return part_out;
}

You might find [that question](http://stackoverflow.com/questions/10624755/openmp-program-is-slower-than-sequential-one) very relevant to your case. — Hristo Iliev, Apr 16 '14 at 08:51
what's the point of the multiplications with `1.0` as in `double wjk=1.0*(dy-ylg)/(1.0*dy);`? Hopefully, the compiler optimises this nonsense away, but why doing this in the first place? — Walter, Apr 16 '14 at 12:52
This'll sound stupid. A long time ago, I had a stupid compiler that would need me to do the 1.0 multiplication, which then just became a habit. :-/ No other reason. Thank you for your response! — Siddharth Krishnamoorthy, Apr 16 '14 at 21:42
@HristoIliev I don't see any call to `rand` in his code, how is that a duplicate? — bennofs, Apr 17 '14 at 13:23
@HristoIliev Oh, I see it now, maybe I should just use search :D sorry — bennofs, Apr 17 '14 at 13:42
@HristoIliev, I tried a couple of things but to no avail. I first made the seed private to a thread, passed it to ApplyParticleBCmex. That didn't speed things up faster than serial and then I looked at the thread you posted and used erand48 like you had suggested in that post (with again, the seed declared firstprivate) and that seems to slow things down even more. Might you have any idea on what I might be doing wrong? Thank you. — Siddharth Krishnamoorthy, Apr 21 '14 at 23:20

Bruno Ferreira · Answer 1 · 2014-04-16T15:20:31.897

0

Some points:

First, the firstprivate directive creates a copy of the declared variables into each thread's stack, so that takes some time. Since these variables won't be changed (i.e., read-only), you may declare them as shared.

~~Second, but causing less impact, the ApplyReflectionBC function takes everything by value, so it will create local copies of each argument. Use references (double &dt, for example).~~

Edit:

As Hristo pointed out, rand() is the source of your problems. You must replace it with some other random number generator function. For both better random numbers and thread-safety, you may use this Mersenne Twister class (if the LGPL 2.1 isn't a problem): http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/VERSIONS/C-LANG/MersenneTwister.h . Just declare it private to your threads, like:

MTRand rng;
#pragma omp parallel for private(rng, ...)
for (..)
{
  ApplyReflectionBC(..., rng);
}

Out_props ApplyReflectionBC(...,MTRand &rng)
{

  // .... Code ....

  xnmf.y=ymin+Lgy*rng.rand(); // MTRand::rand will return a number in the range [0; 1]

  // ........
}

edited Apr 16 '14 at 15:20

answered Apr 15 '14 at 22:03

Bruno Ferreira

1,621
12
19

And how exactly is passing a 64-bit pointer (e.g. on x86_64) faster than passing a 64-bit `double` by value? Besides, the System V AMD64 ABI allows for eight `double` arguments to be passed in the XMM register file while only six integers or pointers/references could be passed in registers. – Hristo Iliev Apr 16 '14 at 08:39
No, the cause of the trouble is, as Hristo correctly pointed out, the use of `rand()` insider the parallel region. This not only won't be fast (when used in parallel), but also not produce the desired randomness. – Walter Apr 16 '14 at 12:55
Hm, I never knew that `rand()` would cause that trouble (I never use it when i need random numbers anyway, I always use a MT class). Thanks for pointing that out @HristoIliev – Bruno Ferreira Apr 16 '14 at 14:57
Yeah thanks a lot for that tip. I had no idea rand had time problems. I've used the MT class before, I just didn't bother this time. – Siddharth Krishnamoorthy Apr 16 '14 at 21:42
The reason I didn't do shared and did firstprivate instead is because I thought that different threads may try and access Egy and Egx at the same time causing a hold up. I don't know if that's slower or creating private copies for all threads...thoughts? – Siddharth Krishnamoorthy Apr 16 '14 at 21:45

OpenMP parallel version running slower than serial

1 Answers1