This code:
#pragma omp parallel for shared(cnt, offset) private(i)
for (i = 0; i < n; ++i) {
++cnt[offset[i]];
}
contains a race-condition during the updates of the array cnt
, to solve it you need to guarantee mutual exclusion of those updates. That can be achieved with (for instance) #pragma omp atomic update but as already pointed out in the comments:
However, this resolves just correctness and may be terribly
inefficient due to heavy cache contention and synchronization needs
(including false sharing). The only solution then is to have each
thread its private copy of cnt and reduce these copies at the end.
The alternative solution is to have a private array per thread, and at end of the parallel region you perform the manual reduction of all those arrays into one. An example of such approach can be found here.
Fortunately, with OpenMP 4.5 you can reduce arrays using a dedicate pragma, namely:
#pragma omp parallel for reduction(+:cnt)
You can have look at this example on how to apply that feature.
Worth mentioning that regarding the reduction of arrays versus the atomic approach as kindly point out by @Jérôme Richard:
Note that this is fast only if the array is not huge (the atomic based
solution could be faster in this specific case regarding the platform
and if the values are not conflicting). So that is m << n. –
As always profiling is the key!; Hence, you should test your code with aforementioned approaches to find out which one is the most efficient.