2

I'm using the MPI.NET library, and I've recently moved my application to a bigger cluster (more COMPUTE-NODES). I've started seeing various collective functions hang indefinitely, but only sometimes. About half the time a job will complete, the rest of the time it'll hang. I've seen it happen with Scatter, Broadcast, and Barrier.

I've put a MPI.Communicator.world.Barrier() call (MPI.NET) at the start of the application, and created trace logs (using the MPIEXEC.exe /trace switch).

C# code snippet:

static void Main(string[] args)
{
    var hostName = System.Environment.MachineName;
    Logger.Trace($"Program.Main entered on {hostName}");
    string[] mpiArgs = null;
    MPI.Environment myEnvironment = null;
    try
    {
        Logger.Trace($"Trying to instantiated on MPI.Environment on {hostName}. Is currently initialized? {MPI.Environment.Initialized}");
        myEnvironment = new MPI.Environment(ref mpiArgs);
        Logger.Trace($"Is currently initialized?{MPI.Environment.Initialized}. {hostName} is waiting at Barrier... ");
        Communicator.world.Barrier(); // CODE HANGS HERE!
        Logger.Trace($"{hostName} is past Barrier");
    }
    catch (Exception envEx)
    {
        Logger.Error(envEx, "Could not instantiate MPI.Environment object");
    }

    // rest of implementation here...

}

I can see the msmpi.dll's MPI_Barrier function being called in the log, and I can see messages being sent and received thereafter for a passing and a failing example. For the passing example, messages are sent/received and then the MPI_Barrier function Leave is logged.

For the failing example it look like one (or more) of the send messages is lost - it is never received by the target. Am I correct in thinking that messages lost within the MPI_Barrier call will mean that the processes never synchronize, therefore all get stuck at the Communicator.world.Barrier() call?

What could be causing this to happen intermittently? Could poor network performance between the COMPUTE-NODES be a cause?

I'm running MS HPC Pack 2008 R2, so the version of MS-MPI is pretty old, v2.0.

EDIT - Additional information If I keep a task running within the same node, then this issue does not happen. For example, if I run a task using 8 cores on one node then fine, but if i use 9 cores on two nodes I'll see this issue ~50% of the time.

Also, we have two clusters in use and this only happens on one of them. They are both virtualized environments, but appear to be set up identically.

mattyB
  • 1,094
  • 10
  • 21
  • Do you have some `Send` calls before the barrier whose matching `Receive` calls are made after the barrier? If so, the send call could block waiting for the receiver rather than using an 'eager' protocol that doesn't wait, and hence the sender never reaches the barrier call at all. – Phil Miller Mar 21 '17 at 16:10
  • More generally, have you confirmed from your logs that *every* process is making the barrier call? – Phil Miller Mar 21 '17 at 16:11
  • @Novelocrat, I've added a snippet of my code (it's a mess, debugging everywhere), as you can see no Send/Receive/collective functions called before the Barrier call. I've looked at the logs for all my processes, they all end with the log message "Is currently initialized?True. myHostName is waiting at Barrier... " – mattyB Mar 21 '17 at 16:22
  • I've got nothing then. Sorry. – Phil Miller Mar 21 '17 at 16:30
  • @Novelocrat No worries, :-) – mattyB Mar 21 '17 at 16:31

0 Answers0