Efficiency of Fortran stream access vs. MPI-IO

Question

I have a parallel section of the code where I write out n large arrays (representing a numerical mesh) in blocks that are later read in different sized blocks. To do this I used Stream access so each processor writes their block independently, but I've seen inconsistent timings taking from 0.5-4 seconds in this section testing with 2 processor groups.

I am aware you can do something similar with MPI-IO, but I'm not sure what the benefits would be since there is no synchronization necessary. I would like to know if there is a way to either improve performance of my writes, or if there is a reason MPI-IO would be a better choice for this section.

Here is a sample of the code section where I create the files to write norb arrays using two groups (mygroup = 0 or 1]:

do irbsic=1,norb
  [various operations]

  blocksize=int(nmsh_tot/ngroups)
  OPEN(unit=iunit,FILE='ZPOT',STATUS='UNKNOWN',ACCESS='STREAM')
  mypos = 1 + (IRBSIC-1)*nmsh_tot*8     ! starting point for writing IRBSIC
  mypos = mypos + mygroup*(8*blocksize) ! starting point for mesh group
  WRITE(iunit,POS=mypos) POT(1:nmsh)  
  CLOSE(iunit)

  OPEN(unit=iunit,FILE='RHOI',STATUS='UNKNOWN',ACCESS='STREAM')
  mypos = 1 + (IRBSIC-1)*nmsh_tot*8     ! starting point for writing IRBSIC
  mypos = mypos + mygroup*(8*blocksize) ! starting point for mesh group
  WRITE(iunit,POS=mypos) RHOG(1:nmsh,1,1)
  CLOSE(iunit)

  [various operations]
end do

Is more than one process writing to a given file? If so I STRONGLY recommend MPI I/O - if you do not you may get incorrect results, a nasty problem I have experienced — Ian Bush, Sep 04 '20 at 18:31
If you are writing to different files, which means you have different unit numbers, then you may be able to use `ASYNCHRONOUS= "YES"`. Your program won't wait for completion of the IO as it has handed to IO to the operating system and you're now constrained by filesystem. — evets, Sep 04 '20 at 19:11
BTW, why compute `mypos` twice? And, is `IRBSIC` suppose to be the do-loop index `iorbsrc`? — evets, Sep 04 '20 at 19:14
@IanBush Yes, multiple processes writing to a single file, but each is writing a different part of the file. Does opening the same file still conflict somehow? — Carlos, Sep 04 '20 at 20:45
@evets I computed mypos twice because I was afraid the past write might increment it, but I see that's not the case now. yes irbsic is the loop index, that was a typo while simplifying the code. I will try the ASYNCHRONOUS flag, thanks for the suggestion. Another thought, do you think it would benefit to move the OPEN and CLOSE statements outside of the loop? Will a close statement still flush with the Asynchronous flag? — Carlos, Sep 04 '20 at 20:45
Fortran I/O is not guaranteed to work if more than one process is writing to a file - this is not just a theoretical standard violation, I have seen this failing producing files partially filled with unreadable values. To quote a Cray engineer "the only sensible, portable way for more than one process to write to a file is via MPI I/O" — Ian Bush, Sep 04 '20 at 20:52
I agree with @IanBush, here. You're opening yourself up to all sorts of race conditions or locking issues (that are outside of the Fortran standard). I don't know your exact situation, but I would have each process write to its own files, and then if needed, merge those files at the end of the computation — evets, Sep 04 '20 at 22:03
Each process writing its own file has its own problems - if you have thousands of processes all hitting the file system for different files you can overwhelm any meta-data server that might be in use, and the simple problem of handling so many files should not be dismissed. Given MPI_FILE_WRITE_AT (or one of its variants) does more or less exactly what the OP wants that's where I would start — Ian Bush, Sep 05 '20 at 07:08
Or alternatively (and possibly better) look into HDF5 or NetCDF — Ian Bush, Sep 05 '20 at 07:09
Collective I/O (such as `MPI_File_read_all()`) should be used whenever possible instead of non collective I/O (such as `MPI_File_read()`) — Gilles Gouaillardet, Sep 05 '20 at 08:03

score 3 · Accepted Answer · answered Sep 05 '20 at 08:16

(As discussed in the comments) I would strongly recommend against using Fortran stream access for this. Standard Fortran I/O is only guaranteed to work if the file is being accessed by a single process, and in my own work I have seen random corruptions of files when multiple processes try to write to them at once, even if the processes are writing to different parts of the file. MPI-I/O, or a library such as HDF5 or NetCDF which uses MPI-I/O is the only sensible way to achieve this. Below is a simple program illustrating the use of mpi_file_write_at_all

ian@eris:~/work/stack$ cat at.f90
Program write_at

  Use mpi

  Implicit None

  Integer, Parameter :: n = 4

  Real, Dimension( 1:n ) :: a

  Real, Dimension( : ), Allocatable :: all_of_a
  
  Integer :: me, nproc
  Integer :: handle
  Integer :: i
  Integer :: error
  
  ! Set up MPI
  Call mpi_init( error )
  Call mpi_comm_size( mpi_comm_world, nproc, error )
  Call mpi_comm_rank( mpi_comm_world, me   , error )

  ! Provide some data
  a = [ ( i, i = n * me, n * ( me + 1 ) - 1 ) ]

  ! Open the file
  Call mpi_file_open( mpi_comm_world, 'stuff.dat', &
       mpi_mode_create + mpi_mode_wronly, mpi_info_null, handle, error )

  ! Describe how the processes will view the file - in this case
  ! simply a stream of mpi_real
  Call mpi_file_set_view( handle, 0_mpi_offset_kind, &
       mpi_real, mpi_real, 'native', &
       mpi_info_null, error )

  ! Write the data using a collective routine - generally the most efficent
  ! but as collective all processes within the communicator must call the routine
  Call mpi_file_write_at_all( handle, Int( me * n,mpi_offset_kind ) , &
       a, Size( a ), mpi_real, mpi_status_ignore, error )

  ! Close the file
  Call mpi_file_close( handle, error )

  ! Read the file on rank zero using Fortran to check the data
  If( me == 0 ) Then
     Open( 10, file = 'stuff.dat', access = 'stream' )
     Allocate( all_of_a( 1:n * nproc ) )
     Read( 10, pos = 1 ) all_of_a
     Write( *, * ) all_of_a
  End If

  ! Shut down MPI
  Call mpi_finalize( error )
  
End Program write_at
ian@eris:~/work/stack$ mpif90 --version
GNU Fortran (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

ian@eris:~/work/stack$ mpif90 -Wall -Wextra -fcheck=all -std=f2008 at.f90 
ian@eris:~/work/stack$ mpirun -np 2 ./a.out 
   0.00000000       1.00000000       2.00000000       3.00000000       4.00000000       5.00000000       6.00000000       7.00000000    
ian@eris:~/work/stack$ mpirun -np 5 ./a.out 
   0.00000000       1.00000000       2.00000000       3.00000000       4.00000000       5.00000000       6.00000000       7.00000000       8.00000000       9.00000000       10.0000000       11.0000000       12.0000000       13.0000000       14.0000000       15.0000000       16.0000000       17.0000000       18.0000000       19.0000000    
ian@eris:~/work/stack$

Thank you for the explanation. This looks like it will be the route I'll end up taking now. One question, is mpi_file_set_view a blocking operation? For my use, processes will be arriving at different times. I see there are non-blocking versions of writes (mpi_file_iwrite_at), but I don't know how to deal with set_view. I could open and and close files outside of the 'irbsic' loop, but it looks like mpi_file_set_view requires the offset, and so will have to be inside the loop. — Carlos, Sep 05 '20 at 22:07
From section 13.3 of the MPI standard at https://www.mpi-forum.org/docs/mpi-3.1/mpi31-report.pdf "MPI_FILE_SET_VIEW is collective". Not also mpi_file_open and mpi_file_close are also collective routines. But I don't see this as a big issue given what you have above - just use 0 as the offset to all procs use global offsets in the file, and open, set the view and close once outside the main calculation, and all should be OK as I understand it — Ian Bush, Sep 06 '20 at 06:37

Efficiency of Fortran stream access vs. MPI-IO

1 Answers1