Parallelizing formated writes in Fortran with OpenMP

Question

I'm trying to parallelize a Fortran code which at one moment writes a tons of numbers to formated output. Some simple profiling showed that most CPU time is spent in format conversion, so I had the idea to do the formatting in parallel to character buffers and later write the unformatted buffers to the file.

My proof of concept looks like this:

program parawrite
   implicit none

   integer (kind = 4) :: i, j, tstart, tstop, rate
   integer (kind = 4), parameter :: bufsize = 100000, n = 10000000, llen = 22
   character (kind=1, len=:), allocatable :: buf
   real (kind=8), dimension(n) :: a

! some input
   do i = 1, n
      a(i) = dble(i) * dble(i)
   enddo

! formated writes for reference
   open(unit=10, file="out1.txt", form="formatted")
   call system_clock(tstart, rate);
   do i = 1, n
      write(10,"(E21.15)") a(i)
   end do
   call system_clock(tstop, rate);
   print *, 'Formated write: ', dble(tstop - tstart) / dble(rate), 's'
   close(10)

! parallel stuff
   open(unit=10, file="out2.txt", access="stream", form="unformatted")
   call system_clock(tstart, rate);

!$omp parallel private(buf, j)
   allocate(character(bufsize * llen) :: buf)
   j = 0;
!$omp do ordered schedule(dynamic,bufsize)
   do i = 1, n
      write (buf(j*llen+1:(j+1)*llen),"(E21.15,A1)") a(i), char(10)
      j = j + 1
      if (mod(i, bufsize) == 0) then
!$omp ordered
         write (10) buf
!$omp end ordered
         j = 0
      end if
   end do
   deallocate(buf)
!$omp end parallel

   close(10)
   call system_clock(tstop, rate);
   print *, 'Parallel write: ', dble(tstop - tstart) / dble(rate), 's'

end program parawrite

When I run it, however, not only is the parallel version much slower when at single thread, it also doesn't scale too much...

$ gfortran -O2 -fopenmp writetest.f90

$ OMP_NUM_THREADS=1 ./a.out
Formated write:    11.330000000000000      s
Parallel write:    15.625999999999999      s

$ OMP_NUM_THREADS=6 ./a.out
Formated write:    11.331000000000000      s
Parallel write:    6.1799999999999997      s

My first question would be how to make it the same speed at single thread? The time spent writing the buffer to the file is negligible, so why are the writes to the buffer slower than when writing directly to file?

My second question is about why the scaling is so bad? I have an equivalent C code which uses sprintf and fwrite and there I can get almost perfect linear scaling (I can post the code if needed), however with Fortran I can only reduce runtime to around 40% at 6 threads (with C I can reduce it to 18% at the same number of threads). It is still faster than the serial version, but I hope this could be improved.

A comment: In my professional work I am forced to use fortran codes, however, I and most of my colleagues refuse to further develop old fortran or even new fortran projects. We always replace new/refurbished parts by C++/python. Thus, I don't really understand questions like this one: if you already know that in C you get better scaling: why you not just replace this (new) functionality of your code with C (or C++)? That should be very simple, and the global combined intelligence in C/C++ is much higher nowadays and in the future compared to the one in fortran... — Ralf Ulrich, Mar 08 '19 at 21:02
I'm not even sure if what the OP is trying to do is well defined by the Fortran Standard - certainly if it were separate processes he would have problems, but threads I'm not so sure about. And if it is why the character buffer and the internal write? Store it in a real buffer, write the real buffer with an appropriate format. You can probably get away without even copying the memory. What am I missing? — Ian Bush, Mar 09 '19 at 10:59
@Ian Bush, thank you for the comment. Can you please elaborate what would be the problem regarding fortran standard? Also what do you mean by "real buffer"? I was able to find some BUFFER IN/OUT statements in some examples (as the HP Fortran Programmer’s Reference) but this looks like non-standard. — Pavel Ondračka, Mar 09 '19 at 12:36
@roygvib, thank you for the comment. I still don't understand why there would be overhead (or even any interaction) in simultaneous writes to different! (private per thread) internal files? Can I get some pointers about such issues? I also wasn't able to find any reasons why the use of internal file should be slower than normal file. — Pavel Ondračka, Mar 09 '19 at 12:44
@PavelOndračka I've just searched the net with possible keywords, but cannot find good pages ... (though these two pages seem to contain some useful info https://comp.lang.fortran.narkive.com/I41hdvuP/why-gfortran-s-stream-access-is-so-slow http://astroa.physics.metu.edu.tr/MANUALS/intel_ifc/mergedProjects/optaps_for/fortran/optaps_prg_io_f.htm For more info, it may be useful to ask a question in [comp.lang.fortran](https://groups.google.com/forum/#!forum/comp.lang.fortran) — roygvib, Mar 09 '19 at 18:48
@RalfUlrich This is just a Language war rant/trolling and can be applied to any question. Why don't you do it in language X when you have problems in language Y? The OP asked about a problem in language Y and that is a completely legitimate question. — Vladimir F Героям слава, Mar 11 '19 at 14:38
@PavelOndračka Note that the magic number kind constants 4, 8 and 1 do not portably mean 4-byte or 8-byte, are ugly code smell and are not portable among all widely used compilers. See https://stackoverflow.com/questions/838310/fortran-90-kind-parameter — Vladimir F Героям слава, Mar 11 '19 at 14:39

roygvib · Accepted Answer · 2019-03-09T22:21:19.983

From some experiments, it seems that an internal file is rather slow if an array element is converted to an internal file one at a time. This is also the case for an external file, but the degree of slowdown seems much greater for internal files (for some reason...). So I've modified the code such that a set of array elements are converted at once and then written to an external file via stream output. Below, four patterns are compared:

Sequential (1): The original code (which writes each element via do-loop)
Sequential (2): Write an array at once (or via implied loop) to an external file
Parallel (1): Make an internal file for many elements and then write to an external file
Parallel (2): Simplest parallel code with formatted write or spirntf for each element

Among these, Parallel (2) + sprintf (marked with *2 in the code) was the fastest, while Parallel (2) + write for each element (marked with *1) was the slowest (timing shown as Parallel (*) in the table, which does not scale with OpenMP for some reason). I guess sprintf will be the fastest probably because of the least amount of internal checks and overhead etc (just a guess!)

Results (please see the bottom for the modified codes)

$ gcc -O3 -c conv.c && gfortran -O3 -fopenmp test.f90 conv.o

# Machine: Core i7-8550U (1.8GHz), 4-core/8-thread, Ubuntu18.04 (GCC7.3.0)

# Note: The amount of data has been reduced to 1/5 of the 
# original code, n = bufsize * 20, but the relative
# timing results remain the same even for larger data.

$ OMP_NUM_THREADS=1 ./a.out
 Sequential (1):   2.0080000000000000      s
 Sequential (2):   1.6510000000000000      s
 Parallel   (1):   1.6960000000000000      s
 Parallel   (2):   1.2640000000000000      s
 Parallel   (*):   3.1480000000000001      s

$ OMP_NUM_THREADS=2 ./a.out
 Sequential (1):   1.9990000000000001      s
 Sequential (2):   1.6479999999999999      s
 Parallel   (1):   0.98599999999999999     s
 Parallel   (2):   0.72999999999999998     s
 Parallel   (*):   1.8600000000000001      s   

$ OMP_NUM_THREADS=4 ./a.out
 Sequential (1):   2.0289999999999999      s
 Sequential (2):   1.6499999999999999      s
 Parallel   (1):   0.61199999999999999     s
 Parallel   (2):   0.49399999999999999     s
 Parallel   (*):   1.4470000000000001      s

$ OMP_NUM_THREADS=8 ./a.out
 Sequential (1):   2.0059999999999998      s
 Sequential (2):   1.6499999999999999      s
 Parallel   (1):   0.56200000000000006     s
 Parallel   (2):   0.41299999999999998     s
 Parallel   (*):   1.7689999999999999      s

main.f90:

program main
    implicit none
    integer :: i, j, k, tstart, tstop, rate, idiv, ind1, ind2
    integer, parameter :: bufsize = 100000, n = bufsize * 20, llen = 22, ndiv = 8
    character(len=:), allocatable :: buf(:), words(:)
    character(llen + 1) :: word
    real(8), allocatable :: a(:)

    allocate( a( n ) )

! Some input
    do i = 1, n
        a(i) = dble(i)**2
    enddo

!.........................................................
! Formatted writes (1).

    open(unit=10, file="dat_seq1.txt", form="formatted")
    call system_clock(tstart, rate);

    do i = 1, n
        write(10,"(ES21.15)") a(i)
    end do

    call system_clock(tstop, rate);
    print *, 'Sequential (1):', dble(tstop - tstart) / dble(rate), 's'
    close(10)

!.........................................................
! Formatted writes (2).

    open(unit=10, file="dat_seq2.txt", form="formatted")
    call system_clock(tstart, rate);

    write( 10, "(ES21.15)" ) a
!    write( 10, "(ES21.15)" ) ( a( k ), k = 1, n )

    call system_clock(tstop, rate);
    print *, 'Sequential (2):', dble(tstop - tstart) / dble(rate), 's'
    close(10)

!.........................................................
! Parallel writes (1): make a formatted string for many elements at once

    allocate( character( llen * bufsize / ndiv ) :: buf( ndiv ) )

    open(unit=10, file="dat_par1.txt", access="stream", form="unformatted")
    call system_clock(tstart, rate);

    do i = 1, n, bufsize

       !$omp parallel do private( idiv, ind1, ind2, k ) shared( i, buf, a )
        do idiv = 1, ndiv
            ind1 = i + (idiv - 1) * bufsize / ndiv
            ind2 = ind1 + bufsize / ndiv - 1

            write( buf( idiv ),"(*(ES21.15, A1))") &
                    ( a( k ), char(10), k = ind1, ind2 )
        enddo
        !$omp end parallel do

        write(10) buf
    end do

    call system_clock(tstop, rate);
    print *, 'Parallel   (1):', dble(tstop - tstart) / dble(rate), 's'
    deallocate(buf)
    close(10)

!.........................................................
! Parallel writes (2): sprintf vs write for each element

    allocate( character( llen ) :: words( n ) )

    open(unit=10, file="dat_par2.txt", access="stream", form="unformatted")
    call system_clock(tstart, rate);

    !$omp parallel do private( i, word ) shared( a, words )
    do i = 1, n
        ! write( word, "(ES21.15, A1)" ) a( i ), char(10)  !! slow (*1)
        call conv( word, a( i ) )  !! sprintf (*2)
        words( i ) = word( 1 : llen )
    enddo
    !$omp end parallel do

    write( 10 ) words

    call system_clock(tstop, rate);
    print *, 'Parallel   (2):', dble(tstop - tstart) / dble(rate), 's'
    close(10)

end program

conv.c:

#include <stdio.h>

void conv_( char *buf, double *val )
{
    sprintf( buf, "%21.15E\n", *val );
}

Wow, thanks. This is a very thorough analysis. Hopefully I can retain the speedups when I convert it back to the ordered loop (as this version uses too much extra memory). I was aware that the conversion one at time is slower than when doing more at once, however it never occurred to me that this would be worse for internal files. Also why doing the former in parallel does not scale is a mystery. This answer solves my issues, however I would still be grateful for explanation of the remaining questions. — Pavel Ondračka, Mar 09 '19 at 19:19
Though just a speculation, I guess the "write" to an external file may be using some internal buffer (for gfortran but not for ifort?), which may make the write operation more efficient (even for element-by-element case). But for internal writes, the library needs to finish the conversion one at a time (to store the result into a character string), which may make the overhead more explicit. And, as for the reason why formatted write for each element does not scale, it is really a mystery.. (there might be some parallel locking thing or resource competition in the internal library?? no idea...) — roygvib, Mar 09 '19 at 19:40
You do not define the interface to `conv` anywhere. Shouldn't the Fortran side therefore assume that a hidden character length is passed with a possible stack corruption happening? — Vladimir F Героям слава, Mar 11 '19 at 14:33
Indeed, why I try to print the value of such a hidden parameter, I get 23. The program is invalid. — Vladimir F Героям слава, Mar 11 '19 at 15:12
[@VladimirF Hi, thanks very much for this point, and I will try to test it with the same computer and update the answer. I think "raw" C always needs much caution...] — roygvib, Mar 27 '19 at 06:07

Parallelizing formated writes in Fortran with OpenMP

1 Answers1