1

I realize this question has been asked before, but not in the context of IO. Is there any reason to believe that:

!compiler can tell that it should write the whole array at once?
!but perhaps compiler allocates/frees temporary array?
write(UNIT) (/( arr(i), i=1,N )/)

would be any more efficient than:

!compiler does lots of IO here?
do i=1,N
   write(UNIT) arr(i)
enddo

for a file which is opened as:

open(unit=UNIT,access='STREAM',file=fname,status='UNKNOWN')

There is a possibly that this will be used with compiler options to turn off buffered writing as well ...

Community
  • 1
  • 1
mgilson
  • 300,191
  • 65
  • 633
  • 696
  • Whaddya mean by 'more efficient' ? If you mean anything more than 'faster' please specify. And if you do mean just 'faster' what have your own measurements told you so far ? – High Performance Mark Sep 24 '12 at 16:04
  • @HighPerformanceMark -- I do mean just `faster`. As far as my own measurements -- I'm not really sure if there is a great way to test that reliably. I can make an attempt to instrument with `mpi_wtime` (is there a better way to benchmark using Fortran?). I was mostly wondering if there's a rule of thumb here ... – mgilson Sep 24 '12 at 16:08
  • @VladimirF -- It needs to work on any standard compliant compiler. I typically use `gfortran`, but it looks like we'll be getting the `intel` compilers for our new cluster. Our old cluster has portland group. ... – mgilson Sep 24 '12 at 16:09
  • 2
    As I often comment - this is a question amenable to experimental investigation and not to argumentation. Test on your platform and form your own conclusions. If anything on your platform (compiler, compiler version, hardware, o/s, ...) changes and if it is important to you re-test and re-measure. No, there are no rules of thumb worth a damn when hard data climbs into the arena. – High Performance Mark Sep 24 '12 at 16:19
  • @HighPerformanceMark -- In a simple benchmark I set up using gfortran, the first form ran approximately 2x as fast, but I have to admit, I'm a little worried my benchmark isn't any good. – mgilson Sep 24 '12 at 16:19
  • Beware, implied do loops with I/O might cause serious memory leaks in Intel Fortran. I encountered this last year, but the problem may very well have been fixed since then. I no longer use implied do loops just to be sure. – bdforbes Sep 24 '12 at 23:18

2 Answers2

3

As suggested by @HighPerformanceMark, here's a simple benchmark I set up:

Using gfortran:

  program main
  implicit none
  include 'mpif.h'
  integer, parameter :: N = 1000000 
  integer :: unit = 22
  integer i
  real*8 arr(N)
  real*8 t1
  integer repeat
  external test1
  external test2
  external test3

  repeat=15

  call MPI_INIT(i)

  arr = 0
  call timeit(test1,repeat,arr,N,t1)
  print*,t1/repeat

  call timeit(test2,repeat,arr,N,t1)
  print*,t1/repeat

  call timeit(test3,repeat,arr,N,t1)
  print*,t1/repeat

  call MPI_Finalize(i)

  end

  subroutine timeit(sub,repeat,arr,size,time)
  include 'mpif.h'
  external sub
  integer repeat
  integer size
  real*8 time,t1
  real*8 arr(size)
  integer i
  time = 0
  do i=1,repeat
     open(unit=10,access='STREAM',file='test1',status='UNKNOWN')
     t1 = mpi_wtime()
     call sub(10,arr,size)
     time = time + (mpi_wtime()-t1)
     close(10)
  enddo

  return
  end

  subroutine test1(ou,a,N)
  integer N
  real*8 a(N)
  integer ou
  integer i
  do i=1,N
     write(ou),a(i)
  enddo
  return
  end

  subroutine test2(ou,a,N)
  integer N
  real*8 a(N)
  integer ou
  integer i
  write(ou),(a(i),i=1,N)
  return
  end

  subroutine test3(ou,a,N)
  integer N
  real*8 a(N)
  integer ou
  write(ou),a(1:N)
  return
  end

My results are (buffered):

temp $ GFORTRAN_UNBUFFERED_ALL=1 mpirun -np 1 ./test
   6.2392100652058922     
   3.3046503861745200     
   9.76902325948079409E-002

(unbuffered):

temp $ GFORTRAN_UNBUFFERED_ALL=0 mpirun -np 1 ./test
  2.7789104779561362     
  0.15584923426310221     
  9.82964992523193415E-002
mgilson
  • 300,191
  • 65
  • 633
  • 696
  • 1
    If others, with different compilers want to add their benchmarks, I would love to see if it makes a difference with other compilers. – mgilson Sep 24 '12 at 17:00
  • Looks ok, but I would not rely on a single execution of a particular code block as an indication of the time taken in that block. I would split your two code blocks into separate programs (this eliminates any concerns that the order of the code blocks matter) and then I would run each program several, or better yet hundreds, of times and take an average of the execution times for each program as the measurement for that code block. Finally I would take the opening of the file outside of your timed region (all you care about is the writing speed, no point having additional code in there). – Chris Sep 24 '12 at 17:50
  • Also, you don't close the files once you are done with them. I would do this, again outside of the time region, for both code blocks. – Chris Sep 24 '12 at 17:50
  • @Chris -- retested moving the file opening out of the initial blocks. Also closed the units when I finished (outside of the timing region). I'm suppose I could run it hundreds of times for each test individually, but that seems like I could start to run into problems where my laptop is working on something else in the meantime... – mgilson Sep 24 '12 at 17:56
  • 2
    Interesting, I should probably review my code and see if I can get rid of some do loops. I use some implied ones, but probably not enough of them. Also, why not just `write(unit) arr(1:N)`? – Vladimir F Героям слава Sep 24 '12 at 18:10
  • @VladimirF -- Because I didn't think of it. My arrays are actually declared in a subroutine as `real*8 :: arr(*)`. I'm not sure if I can slice old style arrays like that. They're part of a module, but I need to have `arr(*)` as I'm passing arrays with different dimensions ... Maybe there's a more clean way? – mgilson Sep 24 '12 at 18:14
  • @mgilson *"I could start to run into problems where my laptop is working on something else in the meantime"* This is *exactly* why you run the test multiple times. If your laptop is busy doing other things during one of your timing blocks but finishes before timing the second block then your timings are meaningless. You should always use the average of a large number of timing tests in order to *estimate* the execution time of a particular piece of code. I can't emphasis this last sentence enough. – Chris Sep 24 '12 at 18:31
  • @mgilson Can you update your test to include @VladimirF's suggestion, I would guess that `write(unit) (/ (arr(i), i=1,N ) /)` simply gets optimised by the compiler to `write(unit) arr(1:N)` (or equivalently just `write(unit) arr`) --- although I am just guessing. – Chris Sep 24 '12 at 18:33
  • @Chris -- Sorry, I wasn't clear. It's not the part where you run the test many times that I take issue with. It's that you're implying that the order of the testing might matter. **IF** the order of the testing matters, then it seems to me that something is wrong with the program. Otherwise, your tests are meaningless anyway since you don't know what will happen before or after you run the test in your real program. – mgilson Sep 24 '12 at 18:34
  • I have no idea here, but does your compiled program has some set up costs the first time it performs a given task which is subsequently cheaper. For example, is it more costly the open the first file than the second file you use in your program? Again, I have no idea if this is even a reasonable suggestion. However, one way to eliminate these concerns is the separate your code blocks into separate minimal programs which are as identical as possible. Anyway, I am only speculating here. Your tests are ok, just repeat them many times and take a average, then I will be happy to upvote this answer. – Chris Sep 24 '12 at 18:38
  • 1
    @Chris -- I've made it so that the tests are each run `repeat` times (here 15). It doesn't seem to change the outcome really ... but the updated code is posted above. – mgilson Sep 24 '12 at 19:03
1

I compiled and ran the above benchmark code using both gfortran (4.7.2 20120921) and ifort (13.0.0.079 Build 20120731). My results are as follows:

gfortran

          UNBUFFERED                BUFFERED
test1:    1.2614487171173097        0.20308602650960286     
test2:    1.0525423844655355        3.4633986155192059E-002
test3:    5.9630711873372398E-003   6.0543696085611975E-003

ifort

          UNBUFFERED                BUFFERED
test1:    1.33864809672038          0.171342913309733
test2:    6.001885732014974E-003    6.095488866170247E-003
test3:    5.962880452473959E-003    6.007925669352213E-003

It would appear that the explicit loop in test1 is by far the most disadvantageous in both cases (without any optimisation flags set). Furthermore, with the Intel compiler there is no significant difference in execution time whether you run write(ou), (a(i), i=1, N) (case 2) or write(ou), a(1:N) (case 3, identical to simply write(ou), a in this case).

By the way, for this single-threaded process you can also just use the fortran 90 (or 95?) intrinsic subroutines cpu_time, which sums over all threads and returns a time in seconds. Otherwise there is also system_clock, which can return the number of elapsed clock cycles and the clock rate as integers, possibly to higher precision.

sigma
  • 2,758
  • 1
  • 14
  • 18
  • you're right about the intrinsics, but everything that I've read states that they are not typically implemented to very high resolution and are therefore not very suited for profiling. Thanks for adding the output of `ifort` though. It's good to know. – mgilson Oct 22 '12 at 17:06