I was writing a simple subroutine in Fortran for dot product of vectors using OpenMP reduction parallelization. However its results are significantly different then dot_product
and non-openMP do loop summation. Please see the code below:
program main
use iso_fortran_env
implicit none
real(real64), allocatable :: x(:)
real(real64) :: x_dot
integer(int32) :: i, n
n = 1000000
allocate(x(n))
x = 0.3_real64
print *, "sum: ", dot_product(x,x)
x_dot = 0.0_real64
do i = 1, n
x_dot = x_dot + x(i)*x(i)
enddo
print *, "loop: ", x_dot
x_dot= 0.0_real64
!$omp parallel do private(i) reduction(+:x_dot)
do i = 1, n
x_dot = x_dot + x(i)*x(i)
enddo
!$omp end parallel do
print *,"omp red: ", x_dot
end program main
output generated:
sum: 89999.999997827967
loop: 89999.999997827967
omp red: 90000.000000129017
As you can see with omp reduction we have difference in single precision level. Am i getting some race condition somewhere?
gfortran 7.5
compilation flags: -O3 -fopenmp
OMP_NUM_THREADS=4