0
auto t1 = chrono::steady_clock::now();
     #pragma omp parallel
     {

        for(int i=0;i<n;i++)
        {
            #pragma omp for collapse(2)
            for(int j=0;j<n;j++)
            {

                for(int k=0;k<n;k++)
                {
                    C[i][j]+=A[i][k]*B[k][j];
                }

            }
        }
     }
auto t2 = chrono::steady_clock::now();

auto t = std::chrono::duration_cast<chrono::microseconds>( t2 - t1 ).count();

With and without the parallelization the variable t remains fairly constant. I am not sure why this is happening. Also once in a while t is outputted as 0. One more problem I am facing is that if I increase value of n to something like 500, the compiler is unable to run the program.(Here I've take n=100) I am using code::blocks with the GNU GCC compiler.

Sanit
  • 80
  • 9
  • 1
    This looks more favorable for parallelism on outer loop. As others hinted, including inner loop in collapse may exclude important optimization within the threads. Throwing on parallelism without first optimising for single thread has limitations anyway. – tim18 May 12 '17 at 13:42

2 Answers2

2

The proposed OpenMP parallelization is not correct and may lead to wrong results. When specifying collapse(2), threads execute "simultaneously" the (j,k) iterations. If two (or more) threads work on the same j but different k, they accumulate the result of A[i][k]*B[k][j] to the same array location C[i][j]. This is a so called race condition, i.e. "two or more threads can access shared data and they try to change it at the same time" (What is a race condition?). Data races do not necessarily lead to wrong results despite the code is not OpenMP valid and can produce wrong results depending on several factors (scheduling, compiler implementation, number of threads,...). To fix the problem in the code above, OpenMP offers the reduction clause:

#pragma omp parallel
    {
        for(int i=0;i<n;i++)  {
            #pragma omp for collapse(2) reduction(+:C)
            for(int j=0;j<n;j++)  {
                for(int k=0;k<n;k++) {
                     C[i][j]+=A[i][k]*B[k][j];

so that "a private copy is created in each implicit task (...) and is initialized with the initializer value of the reduction-identifier. After the end of the region, the original list item is updated with the values of the private copies using the combiner associated with the reduction-identifier" (http://www.openmp.org/wp-content/uploads/openmp-4.5.pdf). Note that the reduction on arrays in C is directly supported by the standard since OpenMP 4.5 (check if the compiler support it, otherwise there are old manual ways to achieve it, Reducing on array in OpenMp).

However, for the given code, it should be probably more adequate to avoid the parallelization of the innermost loop so that the reduction is not needed at all:

#pragma omp parallel
    {
        #pragma omp for collapse(2)
        for(int i=0;i<n;i++)  {
            for(int j=0;j<n;j++)  {
                for(int k=0;k<n;k++) {
                     C[i][j]+=A[i][k]*B[k][j];

Serial can be faster than OpenMP version for small sizes of matrices and/or small number of threads. On my Intel machine using up to 16 cores, n=1000, GNU compiler v6.1 the break even is around 4 cores when the -O3 optimization is activated while the break even is around 2 cores compiling with -O0. For clarity I report the performances I measured:

Serial      418020
----------- WRONG ORIG -- +REDUCTION -- OUTER.COLLAPSE -- OUTER.NOCOLLAPSE -
OpenMP-1   1924950        2841993        1450686          1455989
OpenMP-2    988743        2446098         747333           745830
OpenMP-4    515266        3182262         396524           387671
OpenMP-8    280285        5510023         219506           211913  
OpenMP-16  2227567       10807828         150277           123368

Using reduction the performance loss is dramatic (reversed speed-up). The outer parallelization (w or w/o collapse) is the best option.

As concerns your failure with large matrices, a possible reason is related to the size of the available stack. Try to enlarge both the system and OpenMP stack sizes, i.e.

ulimit -s unlimited
export OMP_STACKSIZE=10000000
Community
  • 1
  • 1
Franz
  • 534
  • 3
  • 8
  • Thanks for the detailed answer but the thing is I get the correct answer in spite of the fact that I did not use reduction with my collapse. Also, I get that when n is much larger than the number of processors collapse might not be needed. Still, according to me, it should give some amount of performance gain over the serial program. Also, if I take n to be something like 500, my program crashes. What could be the reason? – Sanit May 14 '17 at 07:39
  • How do i change the sizes in C++? I can't find a way to change the omp stack size and I get an error 'libgomp: Thread creation failed: Resource temporarily unavailable' . – Sanit May 14 '17 at 10:35
  • Try a smaller value of OMP_STACKSIZE. On my machine the number above works but it depends on the machine. – Franz May 14 '17 at 10:45
  • Mine is a windows machine... it doesn't recognize the OMP_STACKSIZE thing – Sanit May 14 '17 at 11:10
  • Try to compile setting an adequate stacksize (large enough to host static memory but not too large to have enough resources when running using many threads) using the command prompt. For n=1000 I suggest: `g++ -std=c++11 -Wl,--stack,13000000 -fopenmp so.cpp`. Then try to run setting the number of threads with `set OMP_NUM_THREADS=1` and then increasing the number of threads. – Franz May 14 '17 at 22:41
0

The collapse directive may actually be responsible for this, because the index j is recreated using divide/mod operations.

Did you try without collapse?

Gert Wollny
  • 529
  • 2
  • 8
  • Your suggestion worked but I still don't understand what's wrong with using collapse here. Also, why does my program crash if I take it to something like n=500? – Sanit May 13 '17 at 21:17