I'm trying to parallelize a Fortran code which at one moment writes a tons of numbers to formated output. Some simple profiling showed that most CPU time is spent in format conversion, so I had the idea to do the formatting in parallel to character buffers and later write the unformatted buffers to the file.
My proof of concept looks like this:
program parawrite
implicit none
integer (kind = 4) :: i, j, tstart, tstop, rate
integer (kind = 4), parameter :: bufsize = 100000, n = 10000000, llen = 22
character (kind=1, len=:), allocatable :: buf
real (kind=8), dimension(n) :: a
! some input
do i = 1, n
a(i) = dble(i) * dble(i)
enddo
! formated writes for reference
open(unit=10, file="out1.txt", form="formatted")
call system_clock(tstart, rate);
do i = 1, n
write(10,"(E21.15)") a(i)
end do
call system_clock(tstop, rate);
print *, 'Formated write: ', dble(tstop - tstart) / dble(rate), 's'
close(10)
! parallel stuff
open(unit=10, file="out2.txt", access="stream", form="unformatted")
call system_clock(tstart, rate);
!$omp parallel private(buf, j)
allocate(character(bufsize * llen) :: buf)
j = 0;
!$omp do ordered schedule(dynamic,bufsize)
do i = 1, n
write (buf(j*llen+1:(j+1)*llen),"(E21.15,A1)") a(i), char(10)
j = j + 1
if (mod(i, bufsize) == 0) then
!$omp ordered
write (10) buf
!$omp end ordered
j = 0
end if
end do
deallocate(buf)
!$omp end parallel
close(10)
call system_clock(tstop, rate);
print *, 'Parallel write: ', dble(tstop - tstart) / dble(rate), 's'
end program parawrite
When I run it, however, not only is the parallel version much slower when at single thread, it also doesn't scale too much...
$ gfortran -O2 -fopenmp writetest.f90
$ OMP_NUM_THREADS=1 ./a.out
Formated write: 11.330000000000000 s
Parallel write: 15.625999999999999 s
$ OMP_NUM_THREADS=6 ./a.out
Formated write: 11.331000000000000 s
Parallel write: 6.1799999999999997 s
My first question would be how to make it the same speed at single thread? The time spent writing the buffer to the file is negligible, so why are the writes to the buffer slower than when writing directly to file?
My second question is about why the scaling is so bad? I have an equivalent C code which uses sprintf and fwrite and there I can get almost perfect linear scaling (I can post the code if needed), however with Fortran I can only reduce runtime to around 40% at 6 threads (with C I can reduce it to 18% at the same number of threads). It is still faster than the serial version, but I hope this could be improved.