Don't seem to get the speedup I should in openMP

Question

As far as I know, this program should get a speedup of 2 or more when run with 2 threads. Instead of that I get pretty much the same as serially.

static void proc_paralelo (int n, char *vprimos,  int nthr) {

omp_set_num_threads(nthr);


int i, j, prim, posiciones;

int raiz_n = sqrt(n);

for (i=1;i < raiz_n; i++)
{
  if (vprimos[i]==0)
    {
       prim=i+1;
       posiciones=ceil((float)(n-(i+prim))/(float)prim);
#pragma omp parallel for private(j) schedule (static, posiciones/omp_get_num_threads())
        for (j=0; j<posiciones; j++){
            vprimos[i+prim+(j*prim)]=1;}
      }
 }
}

The number of threads I use is 2 (my processor's cores) and the size of n is 20000000.

The times I get are:

serially: 650000000 ns
in parallel: 630000000 ns

What is your program supposed to be doing? You should not get a speedup of more than 2x by using 2 threads, btw. — Bobby Sacamano, Nov 13 '15 at 18:35
how long is `posiciones`, typically? If you're going to utilize multiple threads, it should be pretty large. — Bobby Sacamano, Nov 13 '15 at 18:42

Patrick87 · Answer 1 · 2015-11-13T20:08:36.233

1

By running two threads, you will never (edit: rarely, see comments) see more than a 2x speedup. In fact, because no job is perfectly parallelizable, you will likely not even see that. Consider also that starting a new thread takes considerable resources - you will likely not see any gains, and may see performance losses, unless your workload is sufficiently heavy to saturate the CPU for longer than the time it takes to spin up a new thread (for CPU-bound workloads). You will also be limited by shared resource contention as your threads are sharing some hardware or software resources - see comments for some examples.

edited Nov 13 '15 at 20:08

answered Nov 13 '15 at 18:35

Patrick87

27,682
3
38
73

There can also be issues such as contention for shared CPU cache, increased paging, and more, that all cut into the speedup that can be realized from parallelization. – John Bollinger Nov 13 '15 at 18:37
2

Once in a while, you can see a speedup over 2x from using two cores. An obvious example would be if you're using data that won't fit in one core's cache, but will fit in two cores' caches. In such a case, eliminating a lot of main memory access can improve speed by substantially more than 2x. It's not common, but I have seen it. – Jerry Coffin Nov 13 '15 at 18:56
@JerryCoffin Now that you mention it, I recall reading about that - and your scenario makes sense. I've not seen it either but I guess the take-away is that saying "never" is pretty silly when talking about the real world. – Patrick87 Nov 13 '15 at 20:02

score 0 · Answer 2 · edited May 23 '17 at 12:30

It looks to me like the problem here is that your code is almost certainly memory bound, and using a second core isn't increasing your memory bandwidth.

In particular, your vprimos is apparently about 20 megabytes, which is too much to fit into cache (at least on most processors). The actual calculation you're doing for a single iteration is utterly trivial (compute an address and write a 1), so even for a single core, you're probably mostly memory bound. Adding a second core saves a tiny bit (probably for parts that are in the cache) but that's about it.

In this case, one obvious gain would be from using a single bit to store each Boolean instead of using an entire char. Although it adds some computational overhead, it'll probably save enough memory bandwidth to more than compensate.

One example I threw together a while back (also of the Sieve of Eratosthenes) seems to run about six times as fast as yours for the same size (though it is C++ instead of C).

Don't seem to get the speedup I should in openMP

2 Answers2