2

So more recently, I have been developing some asynchronous algorithms in my research. I was doing some parallel performance studies and I have been suspicious that I am not properly understanding some details about the various non-blocking MPI functions.

I've seen some insightful posts on here, namely:

There's a few things I am uncertain about or just want to clarify related to working with non-blocking functionality that I think will help me potentially increase the performance of my current software.

From the Nonblocking Communication part of the MPI 3.0 standard:

A nonblocking send start call initiates the send operation, but does not complete it. The send start call can return before the message was copied out of the send buffer. A separate send complete call is needed to complete the communication, i.e., to verify that the data has been copied out of the send buffer. With suitable hardware, the transfer of data out of the sender memory may proceed concurrently with computations done at the sender after the send was initiated and before it completed.

...

If the send mode is standard then the send-complete call may return before a matching receive is posted, if the message is buffered. On the other hand, the receive-complete may not complete until a matching receive is posted, and the message was copied into the receive buffer.

So as a first set of questions about the MPI_Isend (and similarly MPI_Irecv), it seems as though to ensure a non-blocking send finishes, I need to use some mechanism to check that it is complete because in the worst case, there may not be suitable hardware to transfer the data concurrently, right? So if I never use something like MPI_Test or MPI_Wait following the non-blocking send, the MPI_Isend may never actually get its message out, right?

This question applies to some of my work because I am sending messages via MPI_Isend and not actually testing for completeness until I get the expected response message because I want to avoid the overhead of MPI_Test calls. While this approach has been working, it seems faulty based on my reading.

Further, the second paragraph appears to say that for the standard non-blocking send, MPI_Isend, it may not even begin to send any of its data until the destination process has called a matching receive. Given the availability of MPI_Probe/MPI_Iprobe, does this mean an MPI_Isend call will at least send out some preliminary metadata of the message, such as size, source, and tag, so that the probe functions on the destination process can know a message wants to be sent there and so the destination process can actually post a corresponding receive?

Related is a question about the probe. In the Probe and Cancel section, the standard says that

MPI_IPROBE(source, tag, comm, flag, status) returns flag = true if there is a message that can be received and that matches the pattern specifed by the arguments source, tag, and comm. The call matches the same message that would have been received by a call to MPI_RECV(..., source, tag, comm, status) executed at the same point in the program, and returns in status the same value that would have been returned by MPI_RECV(). Otherwise, the call returns flag = false, and leaves status undefined.

Going off of the above passage, it is clear the probing will tell you whether there's an available message you can receive corresponding to the specified source, tag, and comm. My question is, should you assume that the data for the corresponding send from a successful probing has not actually been transferred yet?

It seems reasonable to me now, after reading the standard, that indeed a message the probe is aware of need not be a message that the local process has actually fully received. Given the previous details about the standard non-blocking send, it seems you would need to post a receive after doing the probing to ensure the source non-blocking standard send will complete, because there might be times where the source is sending a large message that MPI does not want to copy into some internal buffer, right? And either way, it seems that posting the receive after a probing is how you ensure that you actually get the full data from the corresponding send to be sent. Is this correct?

This latter question relates to one instance in my code where I am doing a MPI_Iprobe call and if it succeeds, I perform an MPI_Recv call to get the message. However, I think this could be problematic now because I was thinking in my mind that if the probe succeeds, that means it has gotten the whole message already. This implied to me that the MPI_Recv would run quickly, then, since the full message would already be in local memory somewhere. However, I am feeling this was an incorrect assumption now that some clarification on would be helpful.

spektr
  • 777
  • 5
  • 14

1 Answers1

4

The MPI standard does not mandate a progress thread. That means that MPI_Isend() might do nothing at all until communications are progressed. Progress occurs under the hood by most MPI subroutines, MPI_Test(), MPI_Wait() and MPI_Probe() are the most obvious ones.

I am afraid you are mixing progress and synchronous send (e.g. MPI_Ssend()).

MPI_Probe() is a local operation, it means it will not contact the sender and ask if something was sent nor progress it.

Performance wise, you should as much as possible avoid unexpected messages, it means a receive should be posted on one end before the message is sent by the other end.

There is a trade-off between performance and portability here :

  • if you want to write portable code, then you cannot assume there is a MPI progress thread
  • if you want to optimize your application on a given system, you should give a try to a MPI library that implements a progress thread on the interconnect you are using

Keep in mind most MPI implementations (read this is not mandated by the MPI standard, and you should not rely on it) send small messages in eager mode. It means MPI_Send() will likely return immediately if the message is small enough (and small enough depends among other things on your MPI implementation, how it is tuned or which interconnect is used).

Gilles Gouaillardet
  • 8,193
  • 11
  • 24
  • 30
  • Thank you very much for your response, I think you've given me some useful responses to my questions. Can you explain what you meant when you said I was mixing up the concepts of progress with synchronous send? And probing is largely done to handle the situation when you know you may get unexpected messages, right? Otherwise, you would generally know to just post an appropriate set of receives without the need to probe, right? – spektr Jul 01 '19 at 01:52
  • 1
    if you do not know the max size of a message to be received, you can use `MPI_Probe()`, allocate a buffer with the right size and then `MPI_Recv()`. Otherwise, you might consider reviewing your algorithm. `MPI_Send()` returns when the send buffer can be reused, and there is no guarantee a matching receive was posted, nor that the message left the host. `MPI_Ssend()` returns when the send buffer can be reused **and** the message send was started because of a matching receive. – Gilles Gouaillardet Jul 01 '19 at 02:01
  • Thank you for all this information, it has been quite useful. An example of my situation is that I have built a distributed container analogous to an array. This container distributes the data across a communicator of processes and allows, for example, to request data within this container at some index. Most often, the index will result in a request of data from another process. As I see it, each process needs to probe just to know if some other process is requesting access to some data, since it cannot predict this in advance. Do you see any alternative to this approach? – spektr Jul 01 '19 at 02:25
  • 1
    `MPI` is a good fit for send/recv or collective operations. You would typically `MPI_Scatter()` the data and `MPI_Gather()` the results. Per your description, you might consider onesided operations (`MPI_Put()` and `MPI_Get()`) or non MPI alternatives such as PGAS (OpenSHMEM, Fortran co-arrays, ...). MPI is generally not a fit to act upon events. – Gilles Gouaillardet Jul 01 '19 at 02:41
  • I may have to take a look at the one-sided operations to see if I could make something like that work in my situation, but your last statement certainly gets me wondering if MPI will be good enough from a performance standpoint. Thank you very much for your insights and thoughts! – spektr Jul 01 '19 at 02:46
  • 2
    The answer is correct and good, but I feel some points are a bit unclear. While MPI doesn't mandate a progress *thread*, it does *guarantee progress* (*eventually*). Of course, you don't want a program that makes progress only after a long delay because no MPI call is performed. But its good to know that you do not need to complete the asynchronous send to make progress on that send. For the second part, the progress guarantee (for sending) does in fact depend on *"a matching receive has been started"*. Thus, since`MPI_Probe` does not start a receive, it does not guarantee progress. – Zulan Jul 01 '19 at 11:03
  • @Zulan thank you for your further comments, especially your first point about progress! – spektr Jul 01 '19 at 16:36