1

if a node in the host file goes down how to work with the remaining nodes using MPI

  • 2
    This was asked [here](http://stackoverflow.com/questions/5386630/fault-tolerance-in-mpich-openmpi); current MPI implementations are all about performance, and as a result have no fault tolerance to speak of (although individual projects have tried to add error recovery to MPI in different ways). But as clusters get larger, fault tolerance becomes a more important issue; there is a [working group on fault tolerance](https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/FaultToleranceWikiPage), and it's widely expected that the next major MPI release will have something along these lines. – Jonathan Dursi Apr 07 '13 at 15:18

2 Answers2

1

The MPI 3 Standard included a draft proposal for "fault tolerance." The proposal was not adopted, but the working group continues to make progress. The expectation is that the proposal will be adopted into some future version of the standard.

I am not aware of any open source MPI implementations that offers support for the draft proposal. I am aware of one commercial MPI that does fully implement the draft proposal on fault tolerance (as a disclaimer....that MPI happens to be the one I work on).

Even with the draft proposal, a "node level" failure will remain VERY difficult to recover from. The current "cookbook" approach for node level failures would be to use checkpoint/restart with a job scheduler to automatically restart the job. If a node fails, the job would be automatically re-scheduled to run on a different set of nodes from the last successful checkpoint.

This cookbook approach requires a robust checkpoint/restart infrastructure, a fault tolerant shared file system, and the active participation of the application and MPI implementation in the checkpoint/restart process. Also, not every MPI & application will be able to be restarted on a different set of nodes...so this approach may require recovering the failed node before the job is restarted.

Stan Graves
  • 6,795
  • 2
  • 18
  • 14
1

As the previous posters have said, there isn't any "Standard" way of handling this, but the draft coming out of the Fault Tolerance working group from the MPI Forum is becoming fairly mature. If you'd like to try it out, there is a reference implementation currently available based on a branch of Open MPI. There will be other implementations that will include the draft coming soon, but for now, your only open source option is available at http://www.fault-tolerance.org. You can get a download of the implementation there along with a version of the draft standard and a few examples to get started. There's a mailing list there as well if you have questions.

Wesley Bland
  • 8,816
  • 3
  • 44
  • 59