2

I'm new to MPI and I'm trying to understand how MPI (and specifically OpenMPI) work in order to reason about the performance of my system.

I've tried to find resources online to help me understand things a little better, but haven't had much luck. I thought I'd come here.

Right now my question is simple: if I have 3 nodes (1 master, 2 clients) and I issue an MPI_Gather, does the root process handle incoming data sequentially or concurrently? In other words, if processes 1 is the first to make a connection with processes 0, will process 2 have to wait until processes 1 is done sending its data before it can start to send its data?

Thanks!

Joshua Gevirtz
  • 401
  • 3
  • 14

2 Answers2

1

There are multiple components in Open MPI that implement collective operations and some of them provide multiple algorithms for the implementation of each operation.

What you are most likely interested in is the tuned component of the coll framework as that is what Open MPI uses by default. tuned implements all collectives using point-to-point operations and provides several algorithms for gather:

  • linear with synchronisation - used when messages are large to mid-size
  • binomial - used when the number of processes is large or the message size is small
  • basic linear - used in all other cases

The performance of each algorithm depends strongly on the particular combination of message size and number of ranks, therefore the library comes with a set of heuristics that tries to determine the best algorithm based on the data size and the size of the communicator (as indicated above). There are several mechanisms to override the heuristics and either force a certain algorithm or provide a list of custom algorithm selection rules.

The basic linear algorithm simply has the root loop over all other ranks receiving their messages in sequence. In that case, rank 2 won't be able to send its chunk before rank 1 since the root will first receive the message from rank 1 and only then move on to rank 2.

The linear with synchronisation algorithm splits the chunks into two pieces each. The first pieces are collected in sequence just like in the basic linear algorithm. The second pieces are collected asynchronously using non-blocking receives.

The binomial algorithm arranges the ranks as a binomial tree. The processes at the nodes of the tree receive the chunks from the lower levels and aggregate them into larger chunks that then get passed to the upper levels until they reach the root rank.

You can find the source code of the tuned module in the ompi/mca/coll/tuned folder of the Open MPI source tree. In the development branch, part of the tuned component got promoted to the base implementation of the collective framework and the code for the gather is to be found in ompi/mca/coll/base instead.

Hristo Iliev
  • 72,659
  • 12
  • 135
  • 186
0

Hristo's answer is of course excellent, but I would like to offer a different point of view.

Contrary to your expectation, the question is not simple. It isn't even possible to specifically answer it without knowing more system specifics, as Hristo pointed out. That doesn't mean the question is invalid, but you should start to reason about performance on a different level.

First, consider the complexity of a the gather operation: The total network transfer to the root as well as the memory requirements are linearly growing with the number of processes in the communicator. This naturally limits scalability.

Second, you may assume that your MPI implementation does implement MPI_Gather in the most efficient way possible - better than you could do it by hand. This assumption may very well be wrong, but it is the best starting point to write your program.

Now when you have your program, you should measure and see where time is spent - or wasted. For that you should an MPI performance analysis tools. Now if you have identified that your Gather has a significant impact on performance, you can go ahead and try to optimize that: But to do so, first consider if you can structure your communication conceptually better, e.g. by somehow removing the computation all together or using a clever reduction instead. If you still need to stick to the gather: go ahead and tune your MPI implementation. Afterwards verify that your optimization did indeed improve performance on your specific system.

Community
  • 1
  • 1
Zulan
  • 21,896
  • 6
  • 49
  • 109