3

I was writing a simple subroutine in Fortran for dot product of vectors using OpenMP reduction parallelization. However its results are significantly different then dot_product and non-openMP do loop summation. Please see the code below:

program main
  use iso_fortran_env
  implicit none
  real(real64), allocatable  :: x(:)
  real(real64)               :: x_dot
  integer(int32)             :: i, n
  n = 1000000
  allocate(x(n))
  x = 0.3_real64
  print *, "sum:     ", dot_product(x,x)
  x_dot = 0.0_real64
  do i = 1, n
    x_dot = x_dot + x(i)*x(i)
  enddo
  print *, "loop:    ", x_dot
  x_dot= 0.0_real64
  !$omp parallel do private(i) reduction(+:x_dot)
  do i = 1, n
    x_dot = x_dot + x(i)*x(i)
  enddo
  !$omp end parallel do
  print *,"omp red: ", x_dot
end program main

output generated:

sum:        89999.999997827967
loop:       89999.999997827967
omp red:    90000.000000129017

As you can see with omp reduction we have difference in single precision level. Am i getting some race condition somewhere?

gfortran 7.5

compilation flags: -O3 -fopenmp

OMP_NUM_THREADS=4

ipcamit
  • 330
  • 3
  • 16
  • 1
    Why do you think those three things should give the same answer? – francescalus Mar 04 '21 at 15:13
  • 1
    @veryreverie Thank you for pointing it out, I had real64 assignment in original file but it did not make any difference. Updated to reflect the same. – ipcamit Mar 04 '21 at 16:00
  • 1
    @francescalus So you mean just by using Open MP I would be introducing an error of 2.3*10^-6 in my calculations? I would expect difference to be atleast <1E-6, as thats a very common tolerance criteria in calculations. Would this not render Open MP dubious for most scientific uses? – ipcamit Mar 04 '21 at 16:01
  • 3
    It's no different from getting a different result by changing the order in which elements of an expression are added (which is essentially what is happening), but there's also no reason to expect that `dot_product` and your (serial or parallel) loop should give the same answer. That's a long way from being "dubious": you need to be aware of numerical issues whenever doing numerical computation. – francescalus Mar 04 '21 at 16:07
  • 1
    (Regarding `dot_product`, see for example [this other question](https://stackoverflow.com/q/58257412/3157076). And [this other question](https://stackoverflow.com/q/20993327/3157076) for OpenMP reductions.) – francescalus Mar 04 '21 at 16:09
  • 3
    If you want to explore this more closely, you can split your serial loop into four different loops which each sum of quarter of the array, then add those partial sums together. You should see a similar difference. – francescalus Mar 04 '21 at 16:23
  • Chances are that the parallel reduction actually gives better results. At least if the data is somewhat homogeneously distributed. – paleonix Mar 08 '21 at 23:09

0 Answers0