Why does the MPI_Alltoallv function interface forces data to belong to only one array?

Question

Suppose I have 3 processes, and each wants to send 3 arrays a0,a1,a2 to processes 0,1, and 2. Given the MPI_Alltoallv interface, that is,

int MPI_Alltoallv(const void *sendbuf, const int *sendcounts, const int *sdispls, MPI_Datatype sendtype,
                        void *recvbuf, const int *recvcounts, const int *rdispls, MPI_Datatype recvtype,
                  MPI_Comm comm)

The interface forces to concatenate the content of a0,a1,a2 into a sendbuf array (possibly in a non-contiguous way). But is this really necessary for the implementation of MPI_Alltoallv? I would think that the sendbuf will eventually have to be re-split into a0,a1,a2, because at the end, the receiving processes are distinct. (The same applies for the recvbuf)

Why isn't the interface the following:

int MPI_Alltoallv(const void **sendbuf, const int *sendcounts, MPI_Datatype sendtype,
                        void **recvbuf, const int *recvcounts, MPI_Datatype recvtype,
                  MPI_Comm comm)

Where I would have, e.g. int* sendbuf[3] = {a0,a1,a2}? (and the same for the receiving arrays)

Is the MPI interface the way it is because:

Performance? (See just below)
Fortran compatibility?
Something else?
Does these reasons also apply to MPI_Neighbor_alltoall_v?

Am I the only one to be annoyed by this "copy in a concatenated array, then MPI will split it anyway, then you will receive another concatenated array, that you may have to split immediately again"?

=======================================

Performance: after some research

The "advice to implementors" of the reference manual says "implementations may use a tree communication pattern. Messages can be forwarded by intermediate nodes where they are split (for scatter) or concatenated (for gather), if this is more efficient". This may be why this interface was chosen. On the other hand, I think it would be possible only if the array is contiguous, and here, it is not garanteed, because of sdispls.
This article mentions the 3 main possible implementations of the alltoall algorithms. Two of them simply are only refinement of the naive send-to-all/recv-from-all algorithm. So, there is actually no need to ask for contiguous data for these two. Only the third one, "Bruck's algorithm", uses a divide-and-conquer approach (which, to my knowledge, needs contiguous data). Depending on the number of procs and the messages size, the 3 perform differently.
Looking at some open-source implementations:
1. MVAPICH implements the 3 approaches
2. MPICH implements only the first two (from my understanding)
3. I don't understand what the OpenMPI implementation is doing ^^'
But it seems that for all of them, there is heuristics in order to choose between algorithms, so the "all data in one array" interface should not be mandatory for performance reasons.

=======================================

Side note: tentative implementation of the non-contiguous interface as a wrapper to MPI_alltoallv (based on @Gilles Gouaillardet remarks):

int my_Alltoallv(const void **sendbuf, const int *scounts, MPI_Datatype stype,
                       void **recvbuf, const int *rcounts, MPI_Datatype rtype,
                  MPI_Comm comm) {
  int n_rank;
  MPI_Comm_size(comm,&n_rank);
  int* sdispls = (*)malloc(n_rank*sizeof(int);
  int* rdispls = (*)malloc(n_rank*sizeof(int);
  for (int i=0; i<n_rank; ++i) {
    MPI_Aint sdispls_i;
    MPI_Aint rdispls_i;
    MPI_Get_address(sendbuf[i],&sdispls_i);
    MPI_Get_address(recvbuf[i],&rdispls_i);
    sdispls[i] = sdispls_i; // Warning: narrowing from MPI_Aint to int
    rdispls[i] = rdispls_i; // Warning: narrowing from MPI_Aint to int
  }
  return MPI_Alltoallv(MPI_BOTTOM, scounts, sdispls, stype, // is MPI_BOTTOM+sdispls[i] == sendbuf[i] regarding C and MPI ?
                       MPI_BOTTOM, rcounts, rdispls, rtype, // is MPI_BOTTOM+rdispls[i] == recvbuf[i] regarding C and MPI ?
                       comm);
}

I think that at least the narrowing from MPI_Aint to int makes it wrong. Plus, I am not sure of what is "reachable" from MPI_BOTTOM (because I think that C only allows pointer arithmetics in the same array [else it is undefined behavior], and here, there is no reason MPI_BOTTOM and a[i] belong to the same array [except if MPI requires that the memory of each process can be view as a big array starting at MPI_BOTTOM]). But I'd like to have a confirmation.

Your comments on what you think the MPI standard should be are best sent directly to the MPI Forum (see instructions at https://www.mpi-forum.org/comments/). Meanwhile, you can use `MPI_Alltoallw()` with `MPI_BOTTOM`. — Gilles Gouaillardet, Sep 12 '20 at 06:57
@GillesGouaillardet "Your comments on what you think the MPI standard should be" Well I don't know what the standard should be, that is why I am tring to understand it on SO. — Bérenger, Sep 12 '20 at 10:04
my bad, I assumed you thought the standard should be less annoying. — Gilles Gouaillardet, Sep 12 '20 at 10:06
@GillesGouaillardet I think using `MPI_Alltoallw()` with different arrays is undefined behavior according to C. The equivalent operation given in https://www.open-mpi.org/doc/v3.1/man3/MPI_Alltoallw.3.php would need `sdispls[i]` where `sdispls[i]` would be e.g. `a[i]-MPI_BOTTOM`, a difference of two pointers not in the same array. This is undefined behavior (but granted, it generally does the right thing... which is even worse :D). — Bérenger, Sep 12 '20 at 10:12
@GillesGouaillardet "the standard should be less annoying". I mean it is annoying but it may be for a very good reason, that is what I would like to know. — Bérenger, Sep 12 '20 at 10:16
`MPI_Get_address(a[i], &sdispls[i])`. Feel free to elaborate why that is undefined behavior. — Gilles Gouaillardet, Sep 12 '20 at 11:07
@GillesGouaillardet I did not know `MPI_Get_address`. `sdispls[i] = a[i]-MPI_BOTTOM` is UB, let's not use it. `MPI_Get_address(a[i], &sdispls[i])` is OK. But now is `MPI_Alltoallw(MPI_BOTTOM,...,sdispls,...)` defined? If implemented as suggested by openmpi, it means doing `MPI_BOTTOM+ sdispls[i]`, where we don't know if `MPI_BOTTOM` and `MPI_BOTTOM+ sdispls[i]` are valid addresses in the same array, this is UB according to C. Now, MPI can make more garantees about `MPI_BOTTOM`, but I would like to know them — Bérenger, Sep 12 '20 at 12:23
@GillesGouaillardet I edited the post based on your remarks (using alltoall_v instead of alltoall_w because it should be enought). There is two concerns: the `MPI_Aint` to `int` conversion, and what is reachable from `MPI_BOTTOM` — Bérenger, Sep 12 '20 at 13:02
You need `MPI_Alltoalllw()` (note `w` vs `v`) and it takes arrays of `MPI_Datatype` you need to build. — Gilles Gouaillardet, Sep 12 '20 at 13:41
@GillesGouaillardet I don't understand what flexibility `MPI_Alltoalllw()` adds in this case. Could you post an answer implementing the `my_Alltoall_v` of the OP using `MPI_Alltoalllw()`? — Bérenger, Sep 12 '20 at 13:53
@GillesGouaillardet As far as I can tell, the semantics of `MPI_Alltoalllv()` and `MPI_Alltoalllw()` are the same regarding displacements. The documentations of MPICH, OpenMPI and the MPI reference manual have exactly the same words for `sdispls` — Bérenger, Sep 12 '20 at 14:49
no they don't! https://www.open-mpi.org/doc/v3.1/man3/MPI_Alltoallv.3.php https://www.open-mpi.org/doc/v3.1/man3/MPI_Alltoallw.3.php — Gilles Gouaillardet, Sep 12 '20 at 15:34
@GillesGouaillardet My bad, one is in byte, the other in `MPI_Datatype`. But appart from the `sizeof()` scaling, what does it change? — Bérenger, Sep 12 '20 at 15:43
@GillesGouaillardet In particular, if `&a[0] > 2^32`, how can it work when the displacement is an `int32` (well, an `int`, which is an` int32` on most systems) — Bérenger, Sep 12 '20 at 15:46
my bad, I really thought displacements in `MPI_Alltoallw()` were in `MPI_Aint` — Gilles Gouaillardet, Sep 13 '20 at 01:39

Why does the MPI_Alltoallv function interface forces data to belong to only one array?

0 Answers0

Linked