MPI:How to measure actual time correctly with or without MPI_Barrier?

Question

My MPI Program to measure broadcast time:

MPI_Barrier(MPI_COMM_WORLD); 
total_mpi_bcast_time -= MPI_Wtime(); 
MPI_Bcast(data, num_elements, MPI_INT, 0, MPI_COMM_WORLD); 
MPI_Barrier(MPI_COMM_WORLD); 
total_mpi_bcast_time += MPI_Wtime();

We need MPI_Barrier to wait until all processes do its jobs completed (synchronization) .But in fact, MPI_Barrier is a collective communication(all processes report to root process to continue program).And so our measured time will be Barrier_time + Broadcast_time. So how to measure only broadcast time correctly ???
This is result from Scalasca:

Estimated aggregate size of event trace:                   1165 bytes
Estimated requirements for largest trace buffer (max_buf): 292 bytes
Estimated memory requirements (SCOREP_TOTAL_MEMORY):       4097kB
(hint: When tracing set SCOREP_TOTAL_MEMORY=4097kB to avoid intermediate flushes
or reduce requirements using USR regions filters.)

flt     type max_buf[B] visits time[s] time[%] time/visit[us]  region
        ALL     291       32   0.38    100.0       11930.30  ALL
        MPI     267       28   0.38    100.0       13630.27  MPI
        COM     24        4    0.00     0.0          30.54  COM

        MPI     114       8    0.00     0.1          33.08  MPI_Barrier
        MPI     57        4    0.00     0.0          26.53  MPI_Bcast
        MPI     24        4    0.00     0.2         148.50  MPI_Finalize
        MPI     24        4    0.00     0.0           0.57  MPI_Comm_size
        MPI     24        4    0.00     0.0           1.61  MPI_Comm_rank
        MPI     24        4    0.38    99.7       95168.50  MPI_Init
        COM     24        4    0.00     0.0          30.54  main

But i don't know how they measure it.Even i run it on a single machine,is MPI_Broadcast cost really 0% ???

Did you mean that our measured time = 2 *Broadcast?But from some papers i read, MPI _Barrier implements and Broadcast implements are not the same algorithm,so MPI_Bcast time != MPI_Barrier . — voxter, Mar 17 '16 at 15:46
I would highly recommend to use a proper performance analysis tool if you are interested in the performance of your parallel communication. Here is [an overview](http://stackoverflow.com/a/10608276/620382), also consider the comments there. — Zulan, Mar 17 '16 at 16:49
@HighPerformanceMark, altough `MPI_Bcast` is a collective call. the only _always synchronising_ call is `MPI_Barrier`. The standard allows for the ranks to exit other collective calls as soon as their job is finished. Without a second barrier, the time for `MPI_Bcast` as measured in different ranks will vary depending on the algorithm. — Hristo Iliev, Mar 17 '16 at 19:10
I need to measure it directly,so performance analysis tools aren't optimal way.Any other ideas for measuring it ?? — voxter, Mar 18 '16 at 14:38
From all papers i read,none of it report about using performance analysis tool to measure communication time. — voxter, Mar 18 '16 at 15:11
@Zulan,please check my result from performance analysis tool from my edit.I am n't sure that measured time is fine. — voxter, Mar 19 '16 at 12:45
That output tells you the average time of each MPI_Bcast call is 26.5usec. The percentage is 0 because the overwhelming amount of time in this program is spent in Mpi_init. This is not at all surprising in a program that only does a single Bcast. — dabo42, Mar 19 '16 at 22:44
You might want to look at the .cubex output file with Scalasca's Cube GUI, which shows you the exact time of the Bcast on each rank. — dabo42, Mar 19 '16 at 22:46

score 0 · Accepted Answer · answered Mar 19 '16 at 05:39

From your example it seems that what you want to know is "the time from the first process entering the Bcast call until the time of the last process leaving the Bcast call". Note that not all of that time is actually spent inside MPI_Bcast. In fact, it is perfectly possible that some processes have left the Bcast call before others have even entered.

Anyway, probably the best way to go is to measure the time between the first Barrier and the Bcast exit on each process, and use a Reduction to find the maximum:

MPI_Barrier(MPI_COMM_WORLD);
local_mpi_bcast_time -= MPI_Wtime();

MPI_Bcast(data, num_elements, MPI_INT, 0, MPI_COMM_WORLD); 
local_mpi_bcast_time += MPI_Wtime();

MPI_Reduce(&local_mpi_bcast_time, &total_mpi_bcast_time, 1, MPI_DOUBLE,            
           MPI_MAX, 0, MPI_COMM_WORLD);

This is still not 100% accurate because processes may leave the barrier at slightly different times, but about the best you can get with MPI means.

I also suggest you take a look at the performance analysis tools suggested by Zulan, because they take care of all the mutli-process communication idiosyncrasies.

Without MPI_Barrier after MPI_Bcast , processes will leave MPI_Bcast immediately after sending or receiving data.We need to wait from first process to last process which complete Broadcast,so we need synchronize call(Barrier) after Broadcast. — voxter, Mar 19 '16 at 08:58
I think MPI_Reduce is not synchronize call,which mean measured time will vary depending on each process,because each process will complete Broadcast by different time — voxter, Mar 19 '16 at 09:01
The reduction will give you the maximum time any process spent between leaving the barrier and the end of the bcast - which is the "total" time in the bcast as you defined it. No additional synchronization needed. — dabo42, Mar 19 '16 at 21:42

MPI:How to measure actual time correctly with or without MPI_Barrier?

1 Answers1