3

I am running a for loop with openmp with dynamic load balancing. I´d like to print how many tasks/iterations each thread processed at the end of the program. The loop looks like this:

chunk=1;
#pragma omp parallel for schedule(dynamic,chunk) private(i)
for(i=0;i<n;i++){
//loop code
}
Marouen
  • 907
  • 3
  • 13
  • 36

3 Answers3

3

Nothing easier. Just split the combined parallel for directive into two separate constructs, which allows you to add extra code before and after the loop:

#pragma omp parallel
{
   int iters = 0;
   #pragma omp for schedule(dynamic,chunk)
   for (int i = 0; i < n; i++) {
      ...
      iters++;
   }
   #pragma omp critical
   printf("Thread %d did %d iterations\n", omp_get_thread_num(), iters);
}
Hristo Iliev
  • 72,659
  • 12
  • 135
  • 186
  • Shouldn't you have `#pragma omp critical` before the `printf`. Also the OP said "at the end of the program" so the OP may have meant he/she wants to save the results and print them at a later time outside of the parallel region. – Z boson Jan 12 '17 at 14:54
  • "Nothing easier." `threadprivate` is easier and also does not require change the structure of the code. – Z boson Jan 12 '17 at 15:44
  • `threadprivate` requires a static/global variable, which opens yet another can of worms. – Hristo Iliev Jan 12 '17 at 15:55
  • Shouldn't you have `#pragma omp critical` before the `printf`? – Z boson Jan 12 '17 at 21:51
  • I probably should... But the idea here is to show the general working of things. I'm less and less convinced that spoon-feeding people the complete solutions is the proper way to teach them (I know, SO is not a school/university...) – Hristo Iliev Jan 12 '17 at 23:18
  • Can you give some examples of why using `threadprivate` in this case opens another can of worms? What kind of problems can it cause? – Z boson Jan 12 '17 at 23:32
  • In this particular case - no. But using static and global variables in multithreaded programs is a bad practice in general and one has to be really careful to avoid possible data races. Also, using thread-private variables results in non-portable performance. For example, the ELF TLS based thread-private implementation of GCC is fast on real ELF systems and terribly slow on macOS, where it is emulated. But then, ELF TLS does not (used to not?) support constructors/destructors and GCC cannot make non-POD C++ classes thread-private in OpenMP. That's why I try to avoid thread-private solutions. – Hristo Iliev Jan 13 '17 at 10:00
  • Is TLS slow on macOS because of how it's implemented in GCC or because of a limitation of Mach-O? Do you have a link about this to recommend? – Z boson Jan 14 '17 at 17:20
  • I've seen the assembly code. On ELF platforms, FS is loaded with a selector that points to the beginning of the TCB, the first element of which is the thread pointer. The TLS is then at a fixed offset from the thread pointer. On GNU systems, the thread pointer points to the TCB itself, therefore accessing the TLS is just `[FS]:const_offset`. On Mach-O this gets emulated for some reason, at least by GCC. The proper way to handle TLS there is [to use pthread_getspecific()](http://lifecs.likai.org/2010/05/mac-os-x-thread-local-storage.html). – Hristo Iliev Jan 14 '17 at 17:50
1

If you truly want to print the number of iterations at the end of your program outside of your parallel region or other code you did (and avoid false sharing) the simple solution is to use threadprivate.

#include <stdio.h>
#include <omp.h>

int iters;
#pragma omp threadprivate(iters)

int main(void) {
  omp_set_dynamic(0); //Explicitly turn off dynamic threads
  int i;
  int n = 10000;
  #pragma omp parallel for schedule(dynamic)
  for(i=0; i<n; i++) {
    iters++;
  }
  #pragma omp parallel
  #pragma omp critical
  printf("Thread %d did %d iterations\n", omp_get_thread_num(), iters);
}

Here is a complicated solution which also requires you to change the structure of your code.

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int main(void) {
  int i;
  int n = 100;
  int nthreads;
  int *aiters;
  #pragma omp parallel
  {
    nthreads = omp_get_num_threads();
    #pragma omp single
    aiters = malloc(sizeof *aiters * nthreads);
    int iters = 0;
    #pragma omp for schedule(dynamic)
    for(i=0; i<n; i++) {
      iters++;
    }
    aiters[omp_get_thread_num()]=iters;
  }
  for(i=0; i<nthreads; i++)
    printf("Thread %d did %d iterations\n", i, aiters[i]);
  free(aiters);
}
Z boson
  • 32,619
  • 11
  • 123
  • 226
  • There is no guarantee that the two parallel regions in your first sample will execute with the same number of threads. _dyn-var_ must be explicitly set to _false_. – Hristo Iliev Jan 12 '17 at 16:08
  • @HristoIliev, I fixed it I think using `omp_set_dynamic(0)`. I was not aware that number of threads could dynamically change between parallel regions without explicitly telling OpenMP to do this. That may be true in principle but I don't know if I have ever seen it in practice. Have you? – Z boson Jan 12 '17 at 21:50
  • @HristoIliev, how come you did not point that out [here](http://stackoverflow.com/questions/18719257/parallel-cumulative-prefix-sums-in-openmp-communicating-values-between-thread#comment27598031_18719257) "schedule(static) has special properties guaranteed by the standard like repeatable distribution pattern". I can't rely on schedule(static) having the same repeatable distribution if the number of threads dynamical changes between parallel regions. – Z boson Jan 12 '17 at 22:07
  • I've overlooked it or was less aware at that time. – Hristo Iliev Jan 12 '17 at 23:16
  • @HristoIliev, well in any case thank you for correcting me now. I'm embarrassed that I did not know of dynamic teams by now. I noticed there are several answers/questions about dynamic teams on SO. It's my first time answering an OpenMP question on SO in months and I already learned something. – Z boson Jan 12 '17 at 23:30
  • 1
    I've learned so much myself answering questions here. Just keep the OpenMP specs open while writing the answer :) – Hristo Iliev Jan 13 '17 at 10:04
0

Use a private individual counter for each thread. There is no other way around.

Something like

int workload[n_threads] = {0, ...};
#pragma omp parallel for schedule(dynamic) private(i)
for(int i=0;i<n;i++){
    //loop code
    workload[omp_get_thread_num()]++;
}
Armen Avetisyan
  • 1,140
  • 10
  • 29