OpenMP Dot Product and Pointers

Question

I'm trying to implement dotproduct in OpenMP with large arrays allocated with malloc. However, when I use reduction(+:result) it produces different results for each program run. Why do I get different results? How can I remedy that? And how can this example be optimized? Here's my code:

#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <omp.h>

const int N = 1e1;

int main ()
{
  int    i, nthreads, tid;
  double x_seq, x_par, *y, *z, cpu_time_used;
  clock_t start, end;

  y = (double*)malloc(sizeof(double)*N);
  z = (double*)malloc(sizeof(double)*N);

  for (i=0; i<N; i++) {
      y[i] = i * 1.0;
      z[i] = i * 2.0;
  }

  x_seq = 0;
  x_par = 0;
  for (i=0; i<N; i++) x_seq += y[i] * z[i];

  #pragma omp parallel shared(y, z) private(i, tid)
  {
      #pragma omp single
      {
          nthreads = omp_get_num_threads();
      }
      tid = omp_get_thread_num();

      #pragma omp parallel for reduction(+:x_par)
      for (i=tid; i<N; i+=nthreads)
      {
          x_par += y[i] * z[i];
      }

  }
  return 0;
}

Please tag your question with the programming language you are using. (I assume this is C?) Also, note that if this is C (not C++), then [don't cast the return value of `malloc()`](http://stackoverflow.com/a/605858/3488231). — user12205, May 13 '15 at 17:09
To answer your question `how can this example be optimized?` the answer is that [it's memory bandwidth bound so it won't scale with the number of threads though that does not mean multiple threads won't help](https://stackoverflow.com/questions/25179738/measuring-memory-bandwidth-from-the-dot-product-of-two-arrays). — Z boson, May 15 '15 at 06:54
@ace, this is valid C/C++ code but not casting will make it invalid C++ code. Do you have anything useful to say about this question? — Z boson, May 15 '15 at 07:04
@Zboson Check the revision history. The first revision did not tag the language, so I asked the OP to tag it accordingly. And since I noticed that OP was using `` and not ``, I guessed it was C and not C++, so I added that it should not be casted if this is C. Please explain more if you still think what I did was wrong. Also, now that it *is* tagged as C, it makes my previous comment even more useful. — user12205, May 15 '15 at 07:42
@ace casting or not casting is a matter of opinion it's not something that is right or wrong. — Z boson, May 15 '15 at 07:49
@Zboson So is writing unreadable code, or `using namespace std`, or numerous other bad programming practice. They may not be wrong, but they are bad. — user12205, May 15 '15 at 08:00

Jonathan Dursi · Accepted Answer · 2015-05-13T21:15:32.170

There's several things wrong here.

Let's look at the loop as it stands:

#pragma omp parallel shared(y, z) private(i, tid)
{
  #pragma omp single
  {
      nthreads = omp_get_num_threads();
  }
  tid = omp_get_thread_num();

  #pragma omp parallel for reduction(+:x_par)
  for (i=tid; i<N; i+=nthreads)
  {
      x_par += y[i] * z[i];
  }
}

So (1) notice that you (presumably) want x_par to be accessible outside of this region. So you'll want the reduction(+:x_par) on the outer section, not the inner. If you also add the very helpful default(none) clause, you'll also find that there's no clause describing the sharing of nthreads; let's explicitly make that shared.

So let's look again:

#pragma omp parallel shared(y, z, nthreads) private(i, tid) reduction(+:x_par) default(none)
{
  #pragma omp single
  {
      nthreads = omp_get_num_threads();
  }
  tid = omp_get_thread_num();

  #pragma omp parallel for 
  for (i=tid; i<N; i+=nthreads)
  {
      x_par += y[i] * z[i];
  }
}

So looking more carefully, we now see that you have two omp parallel sections. That means, if nested parallelism is enabled, that you'll have nthreads tasks each launching nthreads taks to do that loop; so the loop would end up with nthreads times the right answer if everything worked. So let's get rid of the parallel, just using the for:

 #pragma omp parallel shared(y, z, nthreads) private(i, tid) reduction(+:x_par) default(none)
{
  #pragma omp single
  {
      nthreads = omp_get_num_threads();
  }
  tid = omp_get_thread_num();

  #pragma omp for 
  for (i=tid; i<N; i+=nthreads)
  {
      x_par += y[i] * z[i];
  }
}

So that has the sharing correct and isn't nesting parallelism, but it still doesn't give the right answer; it gives a much smaller result. What's wrong? Let's take a look at the for loop. Each thread wants to start at tid and skip nthreads, fine; but why do we want the omp for there?

Let's take a look at a simpler version which works:

#pragma omp parallel shared(y, z) reduction(+:x_par) default(none)
{
  #pragma omp for
  for (i=0; i<N; i++)
  {
      x_par += y[i] * z[i];
  }
}

Note that here we aren't decomposing the loop explicitly with tid and nthreads - we don't have to, because omp for decomposes the loop for us; it assigns loop iterations to threads.

So looking back at what we have, we have a manual decomposition of the loop - which is fine, sometimes that's what you need to do; and an omp for which tries to take that loop and split it up amongst threads. But we're already doing that; the omp for simply makes us skip iterations here!

So getting rid of the omp for

#pragma omp parallel shared(y, z, nthreads) private(i, tid) reduction(+:x_par) default(none)
{
  #pragma omp single
  {
      nthreads = omp_get_num_threads();
  }
  tid = omp_get_thread_num();

  for (i=tid; i<N; i+=nthreads)
  {
      x_par += y[i] * z[i];
  }
}

Gets us the right answer.

OpenMP Dot Product and Pointers

1 Answers1