0

I have been trying to test my complex CSR matrix vector code on a system with 2 CPU having 6 core each. I surprisingly get almost the same timing for 1, 2, 4, 6 or 12 threads. It works and I can see that the corresponding threads are alive during the multiplication but no speed up what so ever. I don't understand if I have done some mistake or just that the problem at hand cannot be scaled.

void spmv_csr(int num_rows, const int* rowPtrs, const int* colIdxs, const double complex* values,  const double complex* x, double complex* y)
{
  double complex rowSum;
  int i, j, row_start, row_end;
  clock_t begin, end;
  begin = clock();
  #pragma omp parallel for private(j, i, row_start, row_end) reduction(+:rowSum)
  for(i = 0; i < num_rows; i++)
  {
    rowSum = 0.00 + 0.00 *I;
    row_start = rowPtrs[i]-1;
    row_end = rowPtrs[i+1]-1;
    for (j=row_start; j<row_end; j++)
    {
        rowSum += ((creal(values[j]) * creal(x[colIdxs[j]-1])) - (cimag(values[j]) *  cimag(x[colIdxs[j]-1]))) + (((creal(values[j]) * cimag(x[colIdxs[j]-1])) +  (cimag(values[j]) * creal(x[colIdxs[j]-1]))) * I);
    }
    y[offset+i] = rowSum;
  }
  end = clock();
  printf("Time Elapsed: %f seconds\n", (double)(end - begin)/CLOCKS_PER_SEC);

}

I get around 0.38 sec for run with 1, 2, 4, 6, 8, 12 threads, I don't understand why I am not even seeing 10% speedup.

Thanks for any inputs in advance.

Romasz
  • 29,662
  • 13
  • 79
  • 154
Waltee
  • 11
  • Are all threads running concurrently? I would call omp_get_num_threads() function to find out actually if the program is taking advantage of the multicore CPU. – Juniar Sep 26 '14 at 18:49
  • 1
    Possible duplicate of a recent question that is a duplicate of a question that is a duplicate of [an older question](http://stackoverflow.com/questions/10673732/openmp-time-and-clock-calculates-two-different-results) - don't use `clock()`. Also, your code is memory-bound and with big matrices performance won't scale. – Hristo Iliev Sep 26 '14 at 18:52
  • @Hristo Iliev Possible, but it was asked by a different User. However you might have to redirect him, to find out if that is what he is asking. – Juniar Sep 26 '14 at 18:59
  • Hi Juniar and Hristo, Thanks for your reply. Yes all the threads are alive at the same time. I did get_num_threads as well as i can see in top -H when the multiplication is being done. I will use omp time and see if i get the performance I am looking for. will keep u posted. Thanks once again. – Waltee Sep 26 '14 at 19:47
  • 1
    Hi Guys, I tried your suggestions and timed using omp_get_wtime(). Still its the same doesn't seem to scale at all. I also made sure once more that the threads are alive. Not sure if something is blocking it or my mix language programming FORTRAN/C is causing a mess while linking. Thanks a lot for your inputs. ~Walter – Waltee Sep 27 '14 at 13:23

1 Answers1

1

It looks like the reduction variable rowSum is becoming a point of serialization. Since rowSum isn't only accumulated as a total but also read from (y[offset+i] = rowSum;) it will have to be serialized.

If you intend for rowSum to be across only a single row, I would remove the reduction and make it private. I would change the pragma to:

#pragma omp parallel for private(j, i, row_start, row_end, rowSum)

If you intend for rowSum to be a total across all rows, I would use my above suggestion to get parallelism, and then use a prefix-sum to modify y afterwards to get the correct totals.

user2548418
  • 1,531
  • 10
  • 17