0

I am trying to parallelize just the innermost loop of matrix multiplication. However, whenever there is more than 1 thread, the matrix multiplication does not store the correct values in the output array, and I am trying to figure out why.

void matrix() {
int i,j,k,sum;
for (i = 0; i < N; i++) {
    for (j = 0; j < N; j++){ 
        sum = 0;
        #pragma omp parallel for shared(sum,i,j) private(k)
            for (k = 0; k < N; k++) {
                #pragma omp critical
                    sum = sum + A[i][k] * B[k][j];
            }
        C[i][j] = sum;
    }
}
}

I also tried using:

void matrix() {
int i,j,k,sum;
for (i = 0; i < N; i++) {
    for (j = 0; j < N; j++){ 
        sum = 0;
        #pragma omp parallel for shared(sum,i,j) private(k)
            for (k = 0; k < N; k++) {
                #pragma omp atomic
                    sum += A[i][k] * B[k][j];
            }
        C[i][j] = sum;
    }
}
}

But that didn't work either. I also tried it without the second #pragma, and with:

void matrixC() {
int i,j,k,sum,np;
for (i = 0; i < N; i++) {
    for (j = 0; j < N; j++){ 
        sum = 0;
        #pragma omp parallel for reduction(+:sum)
            for (k = 0; k < N; k++) {
                    sum = sum + A[i][k] * B[k][j];
            }
        C[i][j] = sum;
    }
}
}

I'm new to OpenMP but from everything I've read online, at least one of these solutions should work. I know its probably a problem with the race condition while adding to sum, but I have no idea why it's still getting the wrong sums.

EDIT: Here is a more complete version of the code:

double A[N][N];
double B[N][N];
double C[N][N];
int CHOOSE = CH;

void matrixSequential() {
int i,j,k,sum;
for (i = 0; i < N; i++) {
    for (j = 0; j < N; j++) {
        sum = 0;
        for (k = 0; k < N; k++) {
            sum += A[i][k] * B[k][j];
        }
        C[i][j] = sum;
    }
}
}

void matrixParallel() {
int i,j,k,sum;
for (i = 0; i < N; i++) {
    for (j = 0; j < N; j++){ 
        sum = 0;
        #pragma omp parallel for shared (i,j) private(k) reduction(+:sum)
            for (k = 0; k < N; k++) {
                sum = sum + A[i][k] * B[k][j];
            }
        C[i][j] = sum;
    }
}
}

int main(int argc, const char * argv[]) {
//populating arrays
int i,j;
for(i=0; i < N; i++){
    for(j=0; j < N; j++){
        A[i][j] = i+j;
        B[i][j] = i+j;
    }
}

for(i=0; i < N; i++){
    for(j=0; j < N; j++){
        C[i][j] = 0;
    }
}

if (CHOOSE == 0) {
    matrixSequential();
}
else if(CHOOSE == 1) {
    matrixParallel();
}

//checking for correctness
double sum;
for(i=0; i < N; i++){
    sum += C[i][i];
}
printf("Sum of diagonal elements of array C: %f \n", sum);
return 0;
}
iltp38
  • 519
  • 2
  • 5
  • 13
  • What values of `N` are using? How did you allocated your 2D arrays? Did you use malloc or just declare them like `int A[N][N]`. Could you show a complete code example? BTW, distributing thread over the inner loop is an inefficient method to parallelize matrix multiplication. – Z boson Sep 29 '15 at 09:43
  • @Zboson I have edited my post to include full code. While testing I have been using a large value of N (over 1000). I am aware of the inefficiency but I would like to think this is still possible! – iltp38 Sep 29 '15 at 09:56
  • 4
    One of your problems may be that your arrays are `double` but you define `sum` as `int` (you do `int i,j,k,sum;`). The other problem is that floating point arithmetic is not associative so you can't expect the same result using multiple threads anyway. – Z boson Sep 29 '15 at 10:32
  • @Zboson Wow, good catch, I can't believe I've been over looking that this whole time! It fixed everything, it wasn't even a problem with how I was parallelizing my loops. Thank you so much, I have no idea how much more time I would have wasted if you hadn't caught that! – iltp38 Sep 29 '15 at 10:56
  • Just to be clear. I assume you meant `matrixSequential()` disagrees with `matrixParallel()`. Since they both define `int i,j,k,sum;` the only reason they would disagree using multiple threads is because floating point math is not associative. Both of these functions should disagree with your finally check `//checking for correctness` which defines `double sum` in which case you should have realized the problem had nothing to do with threading. – Z boson Sep 30 '15 at 07:59
  • Why are you writing your own matrix multiply? Many people have spent huge amounts of time optimizing matrix multiply code (using cache-blocking, vectorization, parallelization). Use something like Intel's Mathe Kernel LIbrary ( https://software.intel.com/en-us/intel-mkl/try-buy ) which is gratis! (Full disclosure: I work for Intel, but not on MKL) – Jim Cownie Sep 30 '15 at 08:32
  • @JimCownie, it's gratis for 12-months for some people. I prefer libraries why are fully open and free of cost such as [Eigen](http://eigen.tuxfamily.org/index.php?title=Main_Page). It's not as fast as MKL in each case but it's nearly as good in most cases (which are memory bandwidth bound) and within a factor of two or so in worst case for compute bound (i.e. GEMM). But in any case learning to write a fast GEMM routine is educational. Mine is faster than Eigen's but not as fast as MKL's. – Z boson Sep 30 '15 at 10:57
  • 1
    @Z Boson. Fine, my point was not so much that you should use MKL, but rather that you should use some optimized matrix library, because achieving high performance on this "simple" operation is *much* more complicated than most people expect, and if someone else has already done the hard work it pays you to use it, (Your educational point is reasonable *if* that is your objective. For many people it's not. They just want to solve their science problem.) – Jim Cownie Oct 01 '15 at 14:27
  • @JimCownie, yeah, I completely agree with your last comment. – Z boson Oct 02 '15 at 07:49

2 Answers2

2

Making sum a reduction variable is the right way of doing this and should work (see https://computing.llnl.gov/tutorials/openMP/#REDUCTION). Note that you still have to declare your shared and private variables, such as k.

Update

After you updated to provide a MVCE, @Zboson found the actual bug: you were declaring the arrays as double but adding them as int.

Davislor
  • 14,674
  • 2
  • 34
  • 49
  • Thanks for the response! But sadly I have tried this as well. I changed it to: #pragma omp parallel for shared (i,j) private(k) reduction(+:sum) and I still did not get the correct sum. – iltp38 Sep 29 '15 at 09:35
  • It’s 3 in the morning or I would be trying to test this. Does the example code run correctly on your system? I wonder what the difference is. – Davislor Sep 29 '15 at 09:49
  • Some of my best work is done at 3 AM :) yet also some of my worst work is done at 3AM as well, so I understand. A non-parallel version of the code works fine. Another version where I parallelize the outermost loop works fine. It's just this innermost loop that doesn't seem to want to cooperate. – iltp38 Sep 29 '15 at 10:00
  • Hmm. Can you collapse the first two loops? Might not help since you refer to i and j. – Davislor Sep 29 '15 at 10:10
0

IEEE Floating point arithmetic is not associative i.e. (a+b)+c is not necessarily equal to a+(b+c). Therefore the order you reduce an array matters. When you distribute the array elements among different threads it changes the order from a sequential sum. The same thing can happen using SIMD. See for example this excellent question using SIMD to do a recution: An accumulated computing error in SSE version of algorithm of the sum of squared differences.

Your compiler won't normally use associative floating point arithmetic unless you tell it. e.g. with -Ofast, or -ffast-math, or -fassocitaive-math with GCC. For example in order to use auto-vectoirization (SIMD) for a reduction the compile requires associtive math.

However, when you use OpenMP it automatically assumes associative math at least for distributing the chunks (within the chucks the compiler still won't use associative arithmetic unless you tell it to) breaking IEEE floating point rules. Many people are not aware of this.

Since the reduction depends on the order you may be interested in a result which reduces the numerical uncertainty. One solution is to use Kahan summation.

Community
  • 1
  • 1
Z boson
  • 32,619
  • 11
  • 123
  • 226