Suppose I have 3 processes, and each wants to send 3 arrays a0,a1,a2
to processes 0,1, and 2.
Given the MPI_Alltoallv
interface, that is,
int MPI_Alltoallv(const void *sendbuf, const int *sendcounts, const int *sdispls, MPI_Datatype sendtype,
void *recvbuf, const int *recvcounts, const int *rdispls, MPI_Datatype recvtype,
MPI_Comm comm)
The interface forces to concatenate the content of a0,a1,a2
into a sendbuf
array (possibly in a non-contiguous way). But is this really necessary for the implementation of MPI_Alltoallv
? I would think that the sendbuf will eventually have to be re-split into a0,a1,a2
, because at the end, the receiving processes are distinct. (The same applies for the recvbuf
)
Why isn't the interface the following:
int MPI_Alltoallv(const void **sendbuf, const int *sendcounts, MPI_Datatype sendtype,
void **recvbuf, const int *recvcounts, MPI_Datatype recvtype,
MPI_Comm comm)
Where I would have, e.g. int* sendbuf[3] = {a0,a1,a2}
? (and the same for the receiving arrays)
Is the MPI interface the way it is because:
- Performance? (See just below)
- Fortran compatibility?
- Something else?
- Does these reasons also apply to
MPI_Neighbor_alltoall_v
?
Am I the only one to be annoyed by this "copy in a concatenated array, then MPI will split it anyway, then you will receive another concatenated array, that you may have to split immediately again"?
=======================================
Performance: after some research
The "advice to implementors" of the reference manual says "implementations may use a tree communication pattern. Messages can be forwarded by intermediate nodes where they are split (for scatter) or concatenated (for gather), if this is more efficient". This may be why this interface was chosen. On the other hand, I think it would be possible only if the array is contiguous, and here, it is not garanteed, because of
sdispls
.This article mentions the 3 main possible implementations of the alltoall algorithms. Two of them simply are only refinement of the naive send-to-all/recv-from-all algorithm. So, there is actually no need to ask for contiguous data for these two. Only the third one, "Bruck's algorithm", uses a divide-and-conquer approach (which, to my knowledge, needs contiguous data). Depending on the number of procs and the messages size, the 3 perform differently.
Looking at some open-source implementations:
- MVAPICH implements the 3 approaches
- MPICH implements only the first two (from my understanding)
- I don't understand what the OpenMPI implementation is doing ^^'
But it seems that for all of them, there is heuristics in order to choose between algorithms, so the "all data in one array" interface should not be mandatory for performance reasons.
=======================================
Side note: tentative implementation of the non-contiguous interface as a wrapper to MPI_alltoallv (based on @Gilles Gouaillardet remarks):
int my_Alltoallv(const void **sendbuf, const int *scounts, MPI_Datatype stype,
void **recvbuf, const int *rcounts, MPI_Datatype rtype,
MPI_Comm comm) {
int n_rank;
MPI_Comm_size(comm,&n_rank);
int* sdispls = (*)malloc(n_rank*sizeof(int);
int* rdispls = (*)malloc(n_rank*sizeof(int);
for (int i=0; i<n_rank; ++i) {
MPI_Aint sdispls_i;
MPI_Aint rdispls_i;
MPI_Get_address(sendbuf[i],&sdispls_i);
MPI_Get_address(recvbuf[i],&rdispls_i);
sdispls[i] = sdispls_i; // Warning: narrowing from MPI_Aint to int
rdispls[i] = rdispls_i; // Warning: narrowing from MPI_Aint to int
}
return MPI_Alltoallv(MPI_BOTTOM, scounts, sdispls, stype, // is MPI_BOTTOM+sdispls[i] == sendbuf[i] regarding C and MPI ?
MPI_BOTTOM, rcounts, rdispls, rtype, // is MPI_BOTTOM+rdispls[i] == recvbuf[i] regarding C and MPI ?
comm);
}
I think that at least the narrowing from MPI_Aint
to int
makes it wrong. Plus, I am not sure of what is "reachable" from MPI_BOTTOM
(because I think that C only allows pointer arithmetics in the same array [else it is undefined behavior], and here, there is no reason MPI_BOTTOM
and a[i] belong to the same array [except if MPI requires that the memory of each process can be view as a big array starting at MPI_BOTTOM
]). But I'd like to have a confirmation.