-1

I have the following example code:

!$omp threadpriavate(var)
!$omp parallel do reduction(+:var)
do
    var = var + compilated_floating_point_computation()
end do
!$omp end parallel do
print *,var

And I get slightly different results for var per run, even when I use the same number of threads. I tried to add order(reproducible:concurrent) openmp clause but got the following compile error: Error: threadprivate variable 'var' used in a region with 'order(concurrent)' clause.

Is there any way to use reduction and still maintain floating point reproducibility over same number of threads?

nadavhalahmi
  • 101
  • 6
  • If you want to exploit parallelism via a OMP reduction you have to not care which order the numbers are added up - which means you will get variation in the results. If you do care you will have to find a way of ordering the results, and while in theory an implementation of an OMP reduction might give you this in practice for performance reasons I suspect none actually does. You will likely have to spin your own. – Ian Bush Jan 23 '23 at 15:39
  • 2
    Does this answer your question? [Float related numerical stability issues for parallel reduction](https://stackoverflow.com/questions/58110473) , [Why are OpenMP Reduction Clauses Non-deterministic for Statically Scheduled Loops?](https://stackoverflow.com/questions/71123704) and [Is floating point math broken?](https://stackoverflow.com/questions/588004) as well as [Why is this OpenMP program giving me different answers every time?](https://stackoverflow.com/questions/33193620) . This is certainly not the only ones. Please read past answers before posting new ones. – Jérôme Richard Jan 23 '23 at 16:06
  • Assuming there is no race condition in "compilated_floating_point_computation()", an explaination could be floating point overflow with var. You have not indicated the DO cycle count or the precision of var. This can be a common problem with real*4 :: var and large cycle count. Actually, by using reduction, this can mitigate this overflow issue, but the only practical solution is a higher precision accumulator. – johncampbell Jan 27 '23 at 02:15
  • @johncampbell no need of overflow, non-repoducibility of floating point computations is inherent to reductions – PierU Jan 27 '23 at 07:39

1 Answers1

1

If your computation is considerably more expensive than the addition reduction, you could create an array with the computation results, and sum those sequentially.

Otherwise, differing results are an intrinsic side-effect of parallelism. Accept that for what it is, use a stable algorithm so that it doesn't matter, or use ensembles of sorts to get a statistically meaningful result.

Victor Eijkhout
  • 5,088
  • 2
  • 22
  • 23