Parallize with OpenMP considering the CPU topology

Question

I wrote a simple code snippet, where the workload is very different for every thread. Some threads need to calculate several hundreds iterations and other threads need to do just one iteration to get the desired result:

for(int i=0; i<height; i++){
    for(int j=0; j<width; j++){
        complex<float> c((float)j/width-1.5,(float)i/height-0.5);
        complex<float> z(0, 0);

        int count =0;
        while(abs (z) < 2 && count < MAX_IT){
            z=z*z + c; 
            ++count; 
        }
        image[i][j]=count;
    }
}

With lscpu I check how many cores, threads per core and cores pro socket are available. Now I want to parallize this snippet with OpenMP aware of the CPU topology.

There is a possibility to define environment variables like

OMP_PLACES='threads(12)
OMP_PLACES='cores(4)'
OMP_PLACES='sockets(2)'

And there is the possibility of processor binding, like

#pragma omp parallel proc_bind(master|close|spread)

I cannot understand how to use them correctly (just with try and error). Does somebody has experience here?

Thank you

It is also unclear to me what you are trying to do. Why doesn't a simple `#pragma omp parallel for collapse(2)` work for you? Why do you think working with bindings will improve performance? — Richard, Jan 13 '19 at 12:53
If you deal with different workload in threads, the solution is in general to use dynamic scheduling #pragma omp parallel for schedule(dynamic) This so answer explains clearly the interest of dynamic schduling https://stackoverflow.com/questions/10850155/whats-the-difference-between-static-and-dynamic-schedule-in-openmp — Alain Merigot, Jan 13 '19 at 12:57
This was an exam question. We had to look up the number of sockets/cores/threads and than we had to parallelize the snippet. During the exam I played with `schedule(dynamic)` and I used at the end the chunk size 4. But I cannot understand the theoretical background why this performed better. — Suslik, Jan 13 '19 at 13:09
If you have hyperthreading active, you might see some advantage from setting OMP_PLACES=cores if your implementation supports it but doesn't make it the default, so as to spread your work out across cores. You would probably adjust OMP_NUM_THREADS to the number of cores. As you find OMP_DYNAMIC useful, you may wish to try OMP_PROC_BIND=false, as setting OMP_PLACES would typically turn on PROC_BIND. By reading the docs for the individual implementations, you will see that the defaults vary. With libgomp, you may get a warning if you choose a facility not supported on your target. — tim18, Jan 13 '19 at 13:27
Thank you. I still not really understand how I have to choose the chunk size in the case I use `schedule(dynamic)`. Is there a connection to the cpu topolpgy. Until now I just tested it, with chunk size gives the best result. — Suslik, Jan 13 '19 at 14:23
There are two separate questions here - about load balancing the code and the topology binding. That's fine in an exam question, but for StackOverflow please focus on one actual question per posted question. Besides, I always find it odd - why don't you ask the professor first? He has much more context than we do. — Zulan, Jan 13 '19 at 15:15
that's Mandelbrot set generator function. Usually it's parallelized distributing rows to different CPUs/cores. what do you mean with chunk size = 4? 4 rows or 4 points? I mean, have you made parallel both loops or the external one only? including the OpenMP directives you have used would help probably. — Sigi, Jan 14 '19 at 08:30
Now I get a good answer from the assistant, maybe for other people who are interested in it: `schedule(dynamic) ` was correct, because of the unbalanced workload. There was no need to change the chunk size in the dynamic-case. — Suslik, Jan 14 '19 at 11:22
Or one has to align the array to the cache line and then use as the chunk size for example 16 `schedule(static,16)` if we work with `int`-elements. — Suslik, Jan 14 '19 at 11:44
Here was also a good explanation: https://stackoverflow.com/questions/45032586/false-sharing-in-openmp-loop-array-access . — Suslik, Jan 14 '19 at 13:14
I create once pics to explain https://stackoverflow.com/a/41517640/2899001 — Ruslan Dautov, Jan 22 '19 at 02:26

Parallize with OpenMP considering the CPU topology

0 Answers0