1

I am using a relatively simple code parallelize with OpenMP to familiarize myself with gprof.

My code mainly consists of gathering data from input files, perform some array manipulations and write the new data to different output files. I placed some calls to the intrinsic subroutine CPU_TIME to see if gprof was being accurate:

PROGRAM main
    USE global_variables
    USE fileio, ONLY: read_old_restart, write_new_restart, output_slice, write_solution
    USE change_vars
    IMPLICIT NONE
    REAL(dp) :: t0, t1

    !~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    CALL CPU_TIME(t0)
    CALL allocate_data
    CALL CPU_TIME(t1)
    PRINT*, "Allocate data =", t1 - t0

    !~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    CALL CPU_TIME(t0)
    CALL build_grid
    CALL CPU_TIME(t1)
    PRINT*, "Build grid    =", t1 - t0

    !~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    CALL CPU_TIME(t0)
    CALL read_old_restart
    CALL CPU_TIME(t1)
    PRINT*, "Read restart  =", t1 - t0


    !~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    CALL CPU_TIME(t0)
    CALL regroup_all
    CALL CPU_TIME(t1)
    PRINT*, "Regroup all   =", t1 - t0

    !~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    CALL CPU_TIME(t0)
    CALL redistribute_all
    CALL CPU_TIME(t1)
    PRINT*, "Redistribute  =", t1 - t0

    !~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    CALL CPU_TIME(t0)
    CALL write_new_restart
    CALL CPU_TIME(t1)
    PRINT*, "Write restart =", t1 - t0
END PROGRAM main

Here is the output:

 Allocate data =  1.000000000000000E-003
 Build grid    =  0.000000000000000E+000
 Read restart  =   10.7963590000000
 Regroup all   =   6.65998700000000
 Redistribute  =   14.3518180000000
 Write restart =   53.5218640000000

Therefore, the write_new_restart subroutine is the most time consuming and takes about 62% of the total run time. However according to grof, the subroutine redistribute_vars, which is called multiple times by redistribute_all is the most time consuming with 70% of the total time. Here is the output from gprof:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
 74.40      8.95     8.95       61     0.15     0.15  change_vars_mp_redistribute_vars_
 19.12     11.25     2.30       60     0.04     0.04  change_vars_mp_regroup_vars_
  6.23     12.00     0.75       63     0.01     0.01  change_vars_mp_fill_last_blocks_
  0.08     12.01     0.01        1     0.01     2.31  change_vars_mp_regroup_all_
  0.08     12.02     0.01                             __intel_ssse3_rep_memcpy
  0.08     12.03     0.01                             for_open
  0.00     12.03     0.00        1     0.00    12.01  MAIN__
  0.00     12.03     0.00        1     0.00     0.00  change_vars_mp_build_grid_
  0.00     12.03     0.00        1     0.00     9.70  change_vars_mp_redistribute_all_
  0.00     12.03     0.00        1     0.00     0.00  fileio_mp_read_old_restart_
  0.00     12.03     0.00        1     0.00     0.00  fileio_mp_write_new_restart_
  0.00     12.03     0.00        1     0.00     0.00  global_variables_mp_allocate_data_


index % time    self  children    called     name
                0.00   12.01       1/1           main [2]
[1]     99.8    0.00   12.01       1         MAIN__ [1]
                0.00    9.70       1/1           change_vars_mp_redistribute_all_ [3]
                0.01    2.30       1/1           change_vars_mp_regroup_all_ [5]
                0.00    0.00       1/1           global_variables_mp_allocate_data_ [13]
                0.00    0.00       1/1           change_vars_mp_build_grid_ [10]
                0.00    0.00       1/1           fileio_mp_read_old_restart_ [11]
                0.00    0.00       1/1           fileio_mp_write_new_restart_ [12]
-----------------------------------------------
                                                 <spontaneous>
[2]     99.8    0.00   12.01                 main [2]
                0.00   12.01       1/1           MAIN__ [1]
-----------------------------------------------
                0.00    9.70       1/1           MAIN__ [1]
[3]     80.6    0.00    9.70       1         change_vars_mp_redistribute_all_ [3]
                8.95    0.00      61/61          change_vars_mp_redistribute_vars_ [4]
                0.75    0.00      63/63          change_vars_mp_fill_last_blocks_ [7]
-----------------------------------------------
                8.95    0.00      61/61          change_vars_mp_redistribute_all_ [3]
[4]     74.4    8.95    0.00      61         change_vars_mp_redistribute_vars_ [4]
-----------------------------------------------
                0.01    2.30       1/1           MAIN__ [1]
[5]     19.2    0.01    2.30       1         change_vars_mp_regroup_all_ [5]
                2.30    0.00      60/60          change_vars_mp_regroup_vars_ [6]
-----------------------------------------------
                2.30    0.00      60/60          change_vars_mp_regroup_all_ [5]
[6]     19.1    2.30    0.00      60         change_vars_mp_regroup_vars_ [6]
-----------------------------------------------
                0.75    0.00      63/63          change_vars_mp_redistribute_all_ [3]
[7]      6.2    0.75    0.00      63         change_vars_mp_fill_last_blocks_ [7]
-----------------------------------------------
                                                 <spontaneous>
[8]      0.1    0.01    0.00                 for_open [8]
-----------------------------------------------
                                                 <spontaneous>
[9]      0.1    0.01    0.00                 __intel_ssse3_rep_memcpy [9]
-----------------------------------------------
                0.00    0.00       1/1           MAIN__ [1]
[10]     0.0    0.00    0.00       1         change_vars_mp_build_grid_ [10]
-----------------------------------------------
                0.00    0.00       1/1           MAIN__ [1]
[11]     0.0    0.00    0.00       1         fileio_mp_read_old_restart_ [11]
-----------------------------------------------
                0.00    0.00       1/1           MAIN__ [1]
[12]     0.0    0.00    0.00       1         fileio_mp_write_new_restart_ [12]
-----------------------------------------------
                0.00    0.00       1/1           MAIN__ [1]
[13]     0.0    0.00    0.00       1         global_variables_mp_allocate_data_ [13]
-----------------------------------------------

For your information, regroup_all calls regroup_vars multiple times and redistribute_all calls redistribute_vars and fill_last_blocks multiple times.

I am compiling my code with ifort with the -openmp -O2 -pg options.

QUESTION:

Why is gprof not seeing the time my file i/o subroutines take? (read_old_restart, write_new_restart)

solalito
  • 1,189
  • 4
  • 19
  • 34
  • I'm not sure why the IO disappear in gprof's output, but from my experience, this usually happens for time that is spent in dynamic libraries. Try to re-link your code statically and see what the profiler gives you... – Gilles Jan 26 '16 at 11:12
  • I'm not using any libraries with this code. – solalito Jan 26 '16 at 11:18
  • 2
    Your are at least using the libc and the Fortran run time library, and that's where the actual IO time is spent. Try adding `-static` to your linking command line, or at least, if you have issues with the plain `-static` option, try `-static-intel`. – Gilles Jan 26 '16 at 11:28

1 Answers1

1

gprof specifically does not include I/O time. It only tries to measure CPU time.

That's because it only does two things: 1) sample the program counter on a 1/100 second clock, and the program counter is meaningless during I/O, and 2) count the number of times any function B is called by any function A.

From the call-counts, it tries to guess how much of each function's CPU time can be attributed to each caller. That's it's whole advance over pre-existing profilers.

When you use gprof, you should understand what it does and what it doesn't do.

Community
  • 1
  • 1
Mike Dunlavey
  • 40,059
  • 14
  • 91
  • 135
  • Thanks, I was in the middle of editing my question after Gilles' recommendation. I started to use gprof with the intent to optimize some of my parallel codes but I guess it is not well suited for that. Is there any specific tool you would recommend? – solalito Jan 26 '16 at 14:42
  • 1
    @solalito: First, turn off the compiler's optimizer and tune it in single-thread mode. You can either single-step it or use [*this method*](http://stackoverflow.com/a/378024/23771), which is not only not a tool, but it works better than tools because it finds more. [*Here's why.*](http://scicomp.stackexchange.com/a/2719/1262) After you cannot tune it any further, then turn on -O3 and the parallelism, and measure the results. – Mike Dunlavey Jan 26 '16 at 15:13