performance related to number of threads in openmp

Question

I have written a small matrix multiplication program using OpenMP. I get best peroformance when I use 2 threads and worst performance when I use 1000 threads. I have total 64 processors. I get best performance when number threads in 1 or 2.

    ~/openmp/mat_mul>  cat /proc/cpuinfo | grep processor | wc -l
    64
    ~/openmp/mat_mul> export OMP_NUM_THREADS=2
    ~/openmp/mat_mul> time ./main 
    Total threads : 2
    Master thread initializing

    real    0m1.536s
    user    0m2.728s
    sys     0m0.200s
    ~/openmp/mat_mul> export OMP_NUM_THREADS=64
    ~/openmp/mat_mul> time ./main 
    Total threads : 64
    Master thread initializing

    real    0m25.755s
    user    4m34.665s
    sys     21m5.595s

This is my code for matrix multiplication.

#include <omp.h>
#include <stdio.h>
#include <stdlib.h>

#define ROW_SIZE_A 100
#define COL_SIZE_A 5000
#define COL_SIZE_B 300

int get_random();

int main(int argc, char* argv[])
{
        int a[ROW_SIZE_A][COL_SIZE_A];
        int b[COL_SIZE_A][COL_SIZE_B];
        int c[ROW_SIZE_A][COL_SIZE_B];
        int i,j,k, tid, thread_cnt;

        srand(time(NULL));

        #pragma omp parallel shared(a,b,c,thread_cnt) private(i,j,k,tid)
        {
                tid = omp_get_thread_num();
                if(tid == 0)
                {
                        thread_cnt = omp_get_num_threads();
                        printf("Total threads : %d\n", thread_cnt);
                        printf("Master thread initializing\n");
                }
                #pragma omp parallel for schedule(static) 
                for(i=0; i<ROW_SIZE_A; i++)
                {
                        for(j=0; j<COL_SIZE_A; j++)
                        {
                                a[i][j] = get_random();
                        }
                }
               #pragma omp parallel for schedule(static) 
                for(i=0; i<COL_SIZE_A; i++)
                {
                        for(j=0; j<COL_SIZE_B; j++)
                        {
                                b[i][j] = get_random();
                        }
                }
                #pragma omp parallel for schedule(static)
                for(i=0; i<ROW_SIZE_A; i++)
                {
                        for(j=0; j<COL_SIZE_B; j++)
                        {
                                c[i][j] = 0;
                        }
                }

                #pragma omp barrier

                #pragma omp parallel for schedule(static) 
                for(i=0; i<ROW_SIZE_A; i++)
                {
                        for(j=0; j<COL_SIZE_B; j++)
                        {
                                c[i][j] = 0;
                                for(k=0; k<COL_SIZE_A; k++)
                                {
                                        c[i][j] += a[i][k] + b[k][j];
                                }
                        }
                }

        }

        return 0;


}

Can somebody tell me why this is happening ?

What did you expect? Generally speaking when dealing with computational intensive work, you get the best performance when the number of threads equals (more or less) the number of cores. — terence hill, Dec 21 '15 at 21:48
I have total 64 processors. I modified the question. For 64 threads, the processing doesn't improve — asit_dhal, Dec 21 '15 at 21:58
Do you really mean processors? Maybe cores (with hyperthreading)? — terence hill, Dec 21 '15 at 22:17

score 2 · Accepted Answer · answered Dec 22 '15 at 00:01

2

Your for-loops are not properly parallelised since you are using the wrong OpenMP construct. parallel for is a combined directive, which both creates a new parallel region and embeds a for worksharing construct in it. The iterations of the loop are then distributed among the threads of the inner region. As a result, you have 64 threads each running all the loops in their entirety and writing simultaneously over c. Besides producing the wrong answer, it also has catastrophic consequences regarding the performance as observed. Also, nested regions by default execute in serial, unless nested parallelism is explicitly enabled by calling omp_set_nested(1); or by setting appropriately the OMP_NESTED environment variable.

Remove the parallel keyword from all for-loops within the parallel region:

    #pragma omp parallel shared(a,b,c,thread_cnt) private(i,j,k,tid)
    {
        ...
        #pragma omp parallel for schedule(static)
                    ^^^^^^^^ 
        for(i=0; i<ROW_SIZE_A; i++)
        {
           ...
        }
        ...
    }

should become

    #pragma omp parallel shared(a,b,c,thread_cnt) private(i,j,k,tid)
    {
        ...
        #pragma omp for schedule(static) 
        for(i=0; i<ROW_SIZE_A; i++)
        {
           ...
        }
        ...
    }

This will enable worksharing of the loop iterations between the threads of the outer region as expected.

answered Dec 22 '15 at 00:01

Hristo Iliev

72,659
12
135
186

The OP could also get rid of the outer parallel region and `private(j)` or `private(j,k)` to each `parallel for`. That would probably look like the original C code. – Z boson Dec 22 '15 at 07:51
Keeping several `for` constructs within a single parallel region is (kind of) faster and more importantly has some educational value. – Hristo Iliev Dec 23 '15 at 00:32
[In this question](http://stackoverflow.com/q/22479258/2542702) you wrote "no OpenMP runtime that exists nowadays kills the worker threads at the end of the parallel region. Rather all threads are kept in a pool and summoned (cheaply) when a new parallel region is entered. MSVC's OpenMP runtime uses the Windows native thread pool implementation, therefore maximum performance with minimum overhead" and I inferred from that that there was no advantage to putting everything in one parallel region. However, I later learned this was not true. – Z boson Dec 23 '15 at 07:55
I learned this in my answer [here](http://stackoverflow.com/questions/31321071/openmp-nested-for-loop-becomes-faster-when-having-parallel-before-outer-loop/31382775#31382775). It was unexpected to me but the thread pool is not necessarily as efficient as I expected. – Z boson Dec 23 '15 at 07:56
1

The devil is in the details. I wrote that the overhead is minimal, not that there is no overhead. In your answer, there are 3000 iterations of the outer loop and 50 ms total overhead from the parallel region management. That gives 17 us per region. Compare it to the time needed to spawn new threads each time and you will see what I mean. Of course, if you have many outer iterations and not much work done in the inner parallel region, the overhead is going to be relatively high. – Hristo Iliev Dec 24 '15 at 12:48

score 0 · Answer 2 · answered Dec 21 '15 at 21:47

0

In general, your processor can only run a fixed number of threads in parallel. Increasing the number of threads beyond that number is not going to speed up your program. Indeed, the high number of threads is causing a considerate scheduling overhead that slows down your computations to a crawl.

Also remember Amdahl's law, parallelism only improves your performance so much.

answered Dec 21 '15 at 21:47

fuz

88,405
25
200
352

I changes the code. It now contains a barrier(to wait for end of initialization). I have 64 processors and I should at least get better performance when no. of threads is around 20. But, performance degrades. – asit_dhal Dec 21 '15 at 22:06
I don't expect performance to improve according Amdahl's law. I expect a little bit improement – asit_dhal Dec 21 '15 at 22:14

performance related to number of threads in openmp

2 Answers2