There are multiple components in Open MPI that implement collective operations and some of them provide multiple algorithms for the implementation of each operation.
What you are most likely interested in is the tuned
component of the coll
framework as that is what Open MPI uses by default. tuned
implements all collectives using point-to-point operations and provides several algorithms for gather:
- linear with synchronisation - used when messages are large to mid-size
- binomial - used when the number of processes is large or the message size is small
- basic linear - used in all other cases
The performance of each algorithm depends strongly on the particular combination of message size and number of ranks, therefore the library comes with a set of heuristics that tries to determine the best algorithm based on the data size and the size of the communicator (as indicated above). There are several mechanisms to override the heuristics and either force a certain algorithm or provide a list of custom algorithm selection rules.
The basic linear algorithm simply has the root loop over all other ranks receiving their messages in sequence. In that case, rank 2 won't be able to send its chunk before rank 1 since the root will first receive the message from rank 1 and only then move on to rank 2.
The linear with synchronisation algorithm splits the chunks into two pieces each. The first pieces are collected in sequence just like in the basic linear algorithm. The second pieces are collected asynchronously using non-blocking receives.
The binomial algorithm arranges the ranks as a binomial tree. The processes at the nodes of the tree receive the chunks from the lower levels and aggregate them into larger chunks that then get passed to the upper levels until they reach the root rank.
You can find the source code of the tuned
module in the ompi/mca/coll/tuned
folder of the Open MPI source tree. In the development branch, part of the tuned
component got promoted to the base implementation of the collective framework and the code for the gather is to be found in ompi/mca/coll/base
instead.