MPI_Barrier doesn't seem to work, reordering printf (stdout) messages

Question

Below is a very basic MPI program

#include <mpi.h>
#include <stdio.h>

int main(int argc, char * argv[]) {
    int rank;
    int size;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    MPI_Barrier(MPI_COMM_WORLD);
    printf("Hello from %d\n", rank);
    MPI_Barrier(MPI_COMM_WORLD);
    printf("Goodbye from %d\n", rank);
    MPI_Barrier(MPI_COMM_WORLD);
    printf("Hello 2 from %d\n", rank);
    MPI_Barrier(MPI_COMM_WORLD);
    printf("Goodbye 2 from %d\n", rank);
    MPI_Finalize();
    return 0;
}

You would expect the output to be (for 2 processes)

Hello from 0
Hello from 1
Goodbye from 1
Goodbye from 0
Hello 2 from 1
Hello 2 from 0
Goodbye 2 from 0
Goodbye 2 from 1

Or something similar (the hellos and goodbyes should be grouped, but process order is not guaranteed).

Here is my actual output:

Hello from 0
Goodbye from 0
Hello 2 from 0
Goodbye 2 from 0
Hello from 1
Goodbye from 1
Hello 2 from 1
Goodbye 2 from 1

Am I fundamentally misunderstanding what MPI_Barrier is supposed to do? From what I can tell, if I use it only once, then it gives me expected results, but any more than that and it seems to do absolutely nothing.

I realize many similar questions have been asked before, but the askers in the questions I viewed misunderstood the function of MPI_Barrier.

@Zulan, I think you are correct. I have marked my question as a duplicate. — electr0sheep, Mar 16 '17 at 19:08

osgx · Answer 1 · 2017-03-12T03:45:18.787

the hellos and goodbyes should be grouped

They should not, there is additional (MPI-asynchronous) buffering inside printf function, and other one buffering is stdout gathering from multiple MPI processes to single user terminal.

printf just prints into in-memory buffer of libc (glibc), which is sometimes flushed to real file descriptor (stdout; use fflush to flush the buffer); and fprintf(stderr,...) usually have less buffering than stdout

Remote tasks are started by mpirun/mpiexec, usually with ssh remote shell, which does stdout/stderr forwarding. ssh (and TCP too) will buffer data and there can be reordering when data from ssh is showed at your terminal by mpirun/mpiexec or other entity (several data streams are multiplexed into one).

What you get is like 4 strings from first process are buffered and flushed at its exit (all strings were printed to stdout with usually has buffer of several kilobytes); and 4 more strings from second process are buffered too till exit. All 4 strings are sent by ssh or other launch method to your console as single "packet" and your console just shows both packets of 4 lines each in some order, either "4_lines_packet_from_id_0; 4_lines_packet_from_id_1;" or as "4_lines_packet_from_id_1;4_lines_packet_from_id_0;".

MPI_Barrier is supposed to do?

MPI_Barrier synchronizes parts of code, but it can't disable any buffering in libc/glibc printing and file I/O functions, nor in ssh or other remote shell.

If all of your processes run on machines with synchronized system clocks (they will be when you have single machine and they should be when there is ntpd on your cluster), you can add timestamp field to every printf to check that real order is respecting your barrier (gettimeofday looks for current time, and has no extra buffering). You may sort timestamped output even if printf and ssh will reorder messages.

#include <mpi.h>
#include <sys/time.h>
#include <stdio.h>

void print_timestamped_message(int mpi_rank, char *s);
{
    struct timeval now;
    gettimeofday(&now, NULL);
    printf("[%u.%06u](%d): %s\n", now.tv_sec, now.tv_usec, mpi_rank, s);
}

int main(int argc, char * argv[]) {
    int rank;
    int size;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    MPI_Barrier(MPI_COMM_WORLD);
    print_timestamped_message(rank, "First Hello");
    MPI_Barrier(MPI_COMM_WORLD);
    print_timestamped_message(rank, "First Goodbye");
    MPI_Barrier(MPI_COMM_WORLD);
    print_timestamped_message(rank, "Second Hello");
    MPI_Barrier(MPI_COMM_WORLD);
    print_timestamped_message(rank, "Second Goodbye");
    MPI_Finalize();
    return 0;
}

I don't think that assumption about `gettimeofday` would hold on many distributed systems. NTP is by no means accurate enough. If you must make some ordering, just increment a counter on each barrier. — Zulan, Mar 12 '17 at 08:36
Zulan, how fast is Barrier and how accurate is NTP in typical clusters? — osgx, Mar 12 '17 at 11:26
Barrier is typically 10-100 us (http://ieeexplore.ieee.org/document/5493494/), NTP has around 1ms under ideal conditions (https://www.eecis.udel.edu/~mills/exec.html). — Zulan, Mar 12 '17 at 15:36

MPI_Barrier doesn't seem to work, reordering printf (stdout) messages

1 Answers1