2

So I was learning about the basics OpenMP in C and work-sharing constructs, particularly for loop. One of the most famous examples used in all the tutorials is of matrix multiplication but all of them just parallelize the outer loop or the two outer loops. I was wondering why we do not parallelize and collapse all the 3 loops (using atomic) as I have done here:

    for(int i=0;i<100;i++){
        //Initialize the arrays
        for(int j=0;j<100;j++){
            A[i][j] = i;
            B[i][j] = j;
            C[i][j] = 0;       
        }       
    }

    //Starting the matrix multiplication
    #pragma omp parallel num_threads(4)
    {
        #pragma omp for collapse(3)
        for(int i=0;i<100;i++){
            for(int j=0;j<100;j++){
                for(int k=0;k<100;k++){
                        #pragma omp atomic
                        C[i][j] = C[i][j]+ (A[i][k]*B[k][j]);
                }       
            }       
        }   
    }

Can you tell me what I am missing here or why is this not an inferior/superior solution?

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59

3 Answers3

3

Atomic operations are very costly on most architectures compared to non-atomic ones (see here to understand why or here for a more detailed analysis). This is especially true when many threads make concurrent accesses to the same shared memory area. To put it simply, one cause is that threads performing atomic operations cannot fully run in parallel without waiting others on most hardware due to implicit synchronizations and communications coming from the cache coherence protocol. Another source of slowdowns is the high-latency of atomic operations (again due to the cache hierarchy).

If you want to write code that scale well, you need to minimize synchronizations and communications (including atomic operations). As a result, using collapse(2) is much better than a collapse(3). But this is not the only issue is your code. Indeed, to be efficient you must perform memory accesses continuously and keep data in caches as much as possible.

For example, swapping the loop iterating over i and the one iterating over k (that does not work with collapse(2)) is several times faster on most systems due to more contiguous memory accesses (about 8 times on my PC):

    for(int i=0;i<100;i++){
        //Initialize the arrays
        for(int j=0;j<100;j++){
            A[i][j] = i;
            B[i][j] = j;
            C[i][j] = 0;       
        }       
    }

    //Starting the matrix multiplication
    #pragma omp parallel num_threads(4)
    {
        #pragma omp for
        for(int i=0;i<100;i++){
            for(int k=0;k<100;k++){
                for(int j=0;j<100;j++){
                    C[i][j] = C[i][j] + (A[i][k]*B[k][j]);
                }
            }
        }
    }

Writing fast matrix-multiplication code is not easy. Consider using BLAS libraries such as OpenBLAS, ATLAS, Eigen or Intel MKL rather than writing your own code if your goal is to use this in production code. Indeed, such libraries are very optimized and often scale well on many cores. If your goal is to understand how to write efficient matrix-multiplication codes, a good starting point may be to read this tutorial.

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59
1

Collapsing loops requires that you know what you are doing as it may result in very cache-unfriendly splits of the iteration space or introduce data dependencies depending on how the product of the loop counts relates to the number of threads.

Imagine the following constructed example, which is not that uncommon actually (the loop counts are small just to illustrate the point):

for (int i = 0; i < 7; i++)
  for (int j = 0; j < 3; j++)
     a[i] += b[i][j];

If you parallelise the outer loop, three threads get two iterations and one thread gets just one, but all of them do all the iterations of the inner loop:

---0-- ---1-- ---2-- -3- (thread number)
000111 222333 444555 666 (values of i)
012012 012012 012012 012 (values of j)

Each a[i] gets processed by one thread only. Smart compilers may implement the inner loop using register optimisation, accumulating the values in a register first and only assigning to a[i] at the very end, and it will run very fast.

If you collapse the two loops, you end up in a very different situation. Since there is a total of 7x3 = 21 iterations now, the default split will be (depending on the compiler and the OpenMP runtime, but most of them do this) five iterations per thread and one gets six iterations:

--0-- --1-- --2-- ---3-- (thread number)
00011 12223 33444 555666 (values of i)
01201 20120 12012 012012 (values of j)

As you can see, now a[1] is processed by both thread 0 and thread 1. Similarly, a[3] is processed by both thread 1 and thread 2. And there you have it - you just introduced a data dependency that wasn't there in the previous case, so now you have to use atomic in order to prevent data races. That price that you pay for synchronisation is way higher than doing one iteration more or less! In your case, if you only collapse the two outer loops, you won't need to use atomic at all (although, in your particular case, 4 divides 100 and even when collapsing all the loops together you don't need the atomic construct, but you need it in the general case).

Another issue is that after collapsing the loops, there is a single loop index and both i and j indices have to be reconstructed from this new index using division and modulo operations. For simple loop bodies like yours, the overhead of reconstructing the indices may be simply too high.

Hristo Iliev
  • 72,659
  • 12
  • 135
  • 186
0

There are very few good reasons not to use a library for matrix-matrix multiplication, so as suggested already, please call BLAS instead of writing this yourself. Nonetheless, the questions you ask are not specific to matrix-matrix multiplication, so they deserve to be answered anyways.

There are a few things that can be improved here:

  1. Use contiguous memory.
  2. If K is the innermost loop, you are doing dot-products, which are harder to vectorize. The loop order IKJ will vectorize better, for example.
  3. If you want to parallelize a dot product with OpenMP, use a reduction instead of many atomics.

I have illustrated each of these techniques independently below.

Contiguous memory

int n = 100;
double * C = malloc(n*n*sizeof(double));
for(int i=0;i<n;i++){
  for(int j=0;j<n;j++){
    C[i*n+j] = 0.0;
  }       
}

IKJ loop ordering

for(int i=0;i<100;i++){
  for(int k=0;k<100;k++){
    for(int j=0;j<100;j++){
      C[i][j] = C[i][j]+ (A[i][k]*B[k][j]);
    }
  }
}

Parallel dot-product

double x = 0;
#pragma omp parallel for reduction(+:x)
for(int k=0;k<100;k++){
  x += (A[i][k]*B[k][j]);
}
C[i][j] += x;

External resources

How to Write Fast Numerical Code: A Small Introduction covers these topics in far more detail.

BLISlab is an excellent tutorial specific to matrix-matrix multiplication that will teach you how the experts write a BLAS library call.

Jeff Hammond
  • 5,374
  • 3
  • 28
  • 45