I have been trying to test my complex CSR matrix vector code on a system with 2 CPU having 6 core each. I surprisingly get almost the same timing for 1, 2, 4, 6 or 12 threads. It works and I can see that the corresponding threads are alive during the multiplication but no speed up what so ever. I don't understand if I have done some mistake or just that the problem at hand cannot be scaled.
void spmv_csr(int num_rows, const int* rowPtrs, const int* colIdxs, const double complex* values, const double complex* x, double complex* y)
{
double complex rowSum;
int i, j, row_start, row_end;
clock_t begin, end;
begin = clock();
#pragma omp parallel for private(j, i, row_start, row_end) reduction(+:rowSum)
for(i = 0; i < num_rows; i++)
{
rowSum = 0.00 + 0.00 *I;
row_start = rowPtrs[i]-1;
row_end = rowPtrs[i+1]-1;
for (j=row_start; j<row_end; j++)
{
rowSum += ((creal(values[j]) * creal(x[colIdxs[j]-1])) - (cimag(values[j]) * cimag(x[colIdxs[j]-1]))) + (((creal(values[j]) * cimag(x[colIdxs[j]-1])) + (cimag(values[j]) * creal(x[colIdxs[j]-1]))) * I);
}
y[offset+i] = rowSum;
}
end = clock();
printf("Time Elapsed: %f seconds\n", (double)(end - begin)/CLOCKS_PER_SEC);
}
I get around 0.38 sec for run with 1, 2, 4, 6, 8, 12 threads, I don't understand why I am not even seeing 10% speedup.
Thanks for any inputs in advance.