1

I've been fighting for a couple of days with an MPI_REDUCE call in a gfortran atmospheric transport model code, having good input parameters, but returning highly unreasonable results in the master's recvbuf. I've been able to replicate the problem in a simple example, as follows:

PROGRAM TEST

    USE mpi

    IMPLICIT NONE

    INTEGER my_rank, size, ierror
    INTEGER, PARAMETER :: nx=1000, ny=1000, nz=6
    INTEGER :: buffsize

    REAL, DIMENSION(nx, ny, nz) :: u, v

    call MPI_INIT(ierror)
    call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)
    call MPI_COMM_RANK(MPI_COMM_WORLD, my_rank, ierror)

    PRINT *, 'my_rank, size: ', my_rank, size

    buffsize = nx*ny*nz

    u = my_rank + 1

    PRINT *, 'PE: ', my_rank, ', Before reduce, SHAPE(u): ', SHAPE(u)
    PRINT *, 'PE: ', my_rank, ', Before reduce, SUM(u): ', SUM(u)

    CALL MPI_REDUCE(u, v, buffsize, MPI_REAL, &
&                   MPI_SUM, 0, MPI_COMM_WORLD, ierror)

    CALL MPI_BARRIER(MPI_COMM_WORLD, ierror)

    PRINT *, 'PE: ', my_rank, ', after reduce, ierror: ', ierror
    PRINT *, 'PE: ', my_rank, ', after reduce, SUM(u): ', SUM(u)
    PRINT *, 'PE: ', my_rank, ', after reduce, SUM(v): ', SUM(v)

    CALL MPI_FINALIZE(ierror)

END PROGRAM test

It returns:

mpirun -np 2 ./test3
 my_rank, size:            0           2
 my_rank, size:            1           2
 PE:            1 , Before reduce, SHAPE(u):         1000        1000           6
 PE:            0 , Before reduce, SHAPE(u):         1000        1000           6
 PE:            0 , Before reduce, SUM(u):    6000000.00    
 PE:            1 , Before reduce, SUM(u):    12000000.0    
 PE:            0 , after reduce, ierror:            0
 PE:            1 , after reduce, ierror:            0
 PE:            1 , after reduce, SUM(u):    12000000.0    
 PE:            0 , after reduce, SUM(u):    6000000.00    
 PE:            1 , after reduce, SUM(v):    0.00000000    
 PE:            0 , after reduce, SUM(v):    18407592.0    

PE0 "should" be showing 18000000.0 as the SUM(v) in the last line.

If I set the nz parameter in the code from 6 to 5, the run produces correct results. What's really confusing is that it behaves this way, returning the same sum of reduced values on a) an AWS EC2 instance with gfortran 5.3 and openmpi, b) my laptop's gfortran 5.4 with mpich, and c) a workstation's gfortran 4.4 with openmpi.

If I change the type of array to DOUBLE PRECISION (as wells as specify that in the MPI_REDUCE call) it works fine, even for much larger arrays. If I use REAL4 rather than REAL, it produces the same bad results.

I know this has to be simple and that I'm being a real idiot here, but I'm just not understanding this. I've read some suggestions that my buffer size needs to be an integer value less than 2^31-1, but that's certainly the case here.

Zulan
  • 21,896
  • 6
  • 49
  • 109
DonMorton
  • 393
  • 2
  • 11

1 Answers1

3

This has nothing to do with MPI, it just a summation precision issue:

PROGRAM TEST
    IMPLICIT NONE
    INTEGER, PARAMETER :: nx=1000, ny=1000, nz=6
    REAL, DIMENSION(nx, ny, nz) :: u
    u = 3
    PRINT *, SUM(u)
END PROGRAM test

Returns the same result. Tf you add a large number to a small number, there can be rounding issues, in a summation of many small numbers, this effect can cumulate to a significant error. There are summation algorithms to prevent this effect, like Kahan summation, apparently Fortran's SUM is not implemented this way.

Community
  • 1
  • 1
Zulan
  • 21,896
  • 6
  • 49
  • 109
  • 4
    Fortran indeed doesn't specify how `sum` works, merely that the result "has a value equal to a processor-dependent approximation to the sum". Different compilers treat this as a quality-of-implementation issue, as can be seen in [this other question](https://stackoverflow.com/q/25316371). – francescalus Apr 26 '17 at 14:44