Why is my OpemMP code performance worst than serial?

Question

I am doing a simple Pi calculation where I parallelize the loop in which random numbers are generated and count is incremented. The serial (non-OpenMP) code performs better than the OpenMP code. Here are some of measurements I took. Both codes are also provided below.

Compiled the serial code as: gcc pi.c -O3

Compiled the OpenMP code as: gcc pi_omp.c -O3 -fopenmp

What could be the problem?

# Iterations = 60000000

Serial Time = 0.893912

OpenMP 1 Threads Time = 0.876654
OpenMP 2 Threads Time = 23.8537
OpenMP 4 Threads Time = 7.72415

Serial Code:

/* Program to compute Pi using Monte Carlo methods */
/* from: http://www.dartmouth.edu/~rc/classes/soft_dev/C_simple_ex.html */

#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <string.h>
#include <time.h>
#include <sys/time.h>
#define SEED 35791246

int main(int argc, char* argv)
{
  int niter=0;
  double x,y;
  int i;
  long count=0; /* # of points in the 1st quadrant of unit circle */
  double z;
  double pi;

  printf("Enter the number of iterations used to estimate pi: ");
  scanf("%d",&niter);

  /* initialize random numbers */
  srand(SEED);
  count=0;
  struct timeval start, end;
  gettimeofday(&start, NULL);
  for ( i=0; i<niter; i++) {
    x = (double)rand()/RAND_MAX;
    y = (double)rand()/RAND_MAX;
    z = x*x+y*y;
    if (z<=1) count++;
  }
  pi=(double)count/niter*4;

  gettimeofday(&end, NULL);
  double t2 = end.tv_sec + (end.tv_usec/1000000.0);
  double t1 = start.tv_sec + (start.tv_usec/1000000.0);

  printf("Time: %lg\n", t2 - t1);

  printf("# of trials= %d , estimate of pi is %lg \n",niter,pi);
  return 0;
}

OpenMP Parallel Code:

#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <string.h>
#include <time.h>
#include <sys/time.h>
#define SEED 35791246
/*
from: http://www.dartmouth.edu/~rc/classes/soft_dev/C_simple_ex.html
 */
#define CHUNKSIZE 500
int main(int argc, char *argv[]) {

  int chunk = CHUNKSIZE;
  int niter=0;
  double x,y;
  int i;
  long count=0; /* # of points in the 1st quadrant of unit circle */
  double z;
  double pi;

  int nthreads, tid;

  printf("Enter the number of iterations used to estimate pi: ");
  scanf("%d",&niter);

  /* initialize random numbers */
  srand(SEED);
  struct timeval start, end;

  gettimeofday(&start, NULL);
  #pragma omp parallel shared(chunk) private(tid,i,x,y,z) reduction(+:count)  
  {                                                                                                           
    /* Obtain and print thread id */
    tid = omp_get_thread_num();
    //printf("Hello World from thread = %d\n", tid);

    /* Only master thread does this */
    if (tid == 0)
    {
      nthreads = omp_get_num_threads();
      printf("Number of threads = %d\n", nthreads);
    }

    #pragma omp for schedule(dynamic,chunk)                                                                       
    for ( i=0; i<niter; i++) {                                                                              
      x = (double)rand()/RAND_MAX;                                                                          
      y = (double)rand()/RAND_MAX;                                                                          
      z = x*x+y*y;                                                                                          
      if (z<=1) count++;                                                                                    
    }                                                                                                       
  }                                                                                                           

  gettimeofday(&end, NULL);
  double t2 = end.tv_sec + (end.tv_usec/1000000.0);
  double t1 = start.tv_sec + (start.tv_usec/1000000.0);

  printf("Time: %lg\n", t2 - t1);

  pi=(double)count/niter*4;                                                                                   
  printf("# of trials= %d, threads used: %d, estimate of pi is %lg \n",niter,nthreads, pi);
  return 0;
}

You don't give your run times number of iterations or anything. Open MP does take a very long time to initialize though. — camelccc, Apr 16 '17 at 21:46
@camelccc, OpenMP does have some overhead, but that in no way explains the timings presented at the beginning of the question. — John Bollinger, Apr 16 '17 at 22:05
`%lg` is not a valid field designator, though perhaps your C library accepts it as an extension. `printf()` cannot distinguish between `float` and `double` arguments (because of the nature of varargs functions), so you don't need a length modifier to tell it you're sending a `double`. — John Bollinger, Apr 16 '17 at 22:16
Possible duplicate of [OpenMP program is slower than sequential one](http://stackoverflow.com/questions/10624755/openmp-program-is-slower-than-sequential-one) (looks like the third dupe of that within the last couple of hours, *sigh*. — Zulan, Apr 16 '17 at 22:46

score 1 · Answer 1 · answered Apr 16 '17 at 22:13

1

In this particular case, there are many possibilities since openMP takes 10K - 100K cycles to start a loop, performance improvements with openMP are non trivial.

after this we have the additional problem that rand is not re-entrant http://man7.org/linux/man-pages/man3/rand.3.html

so most likely rand can only be called by one thread at a time, hence your open MP version is essentially single threaded since your loop does little else, with the additional contention overhead every time rand is called - hence the dramatic slowdown.

answered Apr 16 '17 at 22:13

camelccc

2,847
8
26
52

Whereas it's certainly true that `rand()` is not thread safe, calls to it are not automatically serialized. It is not plausible that the slowdown would be related to calls to this function. – John Bollinger Apr 16 '17 at 22:18
Not plausible, I disagree. It's undefined behaviour for sure. What any implementation does with UB is anything, whether serializing, or failing to give random data is a guess. There are many things that serialize, though we don't even know what version of glibc he is linking against – camelccc Apr 16 '17 at 22:31
Yes, because there is a data race, the program exhibits UB -- that would be a good point to raise as a comment on the question. Yes, serializing threads' runs through the function is a *possible* behavior under the circumstances. But possible is not the same thing as *plausible*. And even if it were plausible, you've essentially just said that you're guessing. Moreover, your guess does not explain why the OP reports much longer runtime for two threads than for four. – John Bollinger Apr 16 '17 at 22:43
1

@JohnBollinger It is absolutely plausible (in fact most likely) that the performance issue is caused by `rand`. As clearly demonstrated by the hoards of dupes of http://stackoverflow.com/questions/10624755/openmp-program-is-slower-than-sequential-one. – Zulan Apr 16 '17 at 22:53
@Zulan, you're right, my earlier statement was too broad. I should have simply said that the mechanism proposed in this answer -- automatic serialization of calls to `rand()` -- is not a plausible explanation for the observations. There is absolutely no reason to think that such serialization occurs. – John Bollinger Apr 16 '17 at 23:25
1

@JohnBollinger Well, glibc [actually does serialize `rand`](http://code.metager.de/source/xref/gnu/glibc/stdlib/random.c#287). Of course that doesn't mean it's portably safe to use `rand` like that. Not sure what you mean by *automatic serialization*. – Zulan Apr 16 '17 at 23:40
@Zulan, although that's a bit of a surprise, what glibc actually does isn't really the point of my objection. The point is that this answer (1) speculates (2) that *because `rand()` is not thread-safe* the implementation serializes calls to `rand()`, as if that's something that you can expect for such functions. That doesn't even make sense -- not being able to expect it is what makes them non-thread-safe. Finally, (3) the answer does not explain why, if the problem is thread contention overhead, the 4-thread version is so much faster than the 2-thread. – John Bollinger Apr 17 '17 at 02:47
@JohnBollinger I honestly can't think why you are surprised by this. The number of calls that end up serialized is endless, even if for no other reason than heap allocation. – camelccc Apr 17 '17 at 03:18

TrentP · Accepted Answer · 2017-04-26T00:45:34.423

rand() is not re-entrant. It will either not work properly, crash, or only be possible to call from one thread at a time. Libraries like glibc will often serialize or use TLS for legacy non-re-entrant functions rather than have them randomly crash when they get used in multi-threaded code.

Try the re-entrant form, rand_r():

tid = omp_get_thread_num();
unsigned int seed = tid;
...
x = (double)rand_r(&seed)/RAND_MAX;

I think you'll find it's much faster.

Notice how I set the seed to the tid. You might think, why not initialize the seed to SEED? Given the same seed, rand_r() will produce the same sequence of numbers. If each thread uses the same series of pseudo-random numbers, it defeats the point of doing more iterations! You've got to get each thread to use different numbers.

Why is my OpemMP code performance worst than serial?

2 Answers2