How to transfer a matrix from one process to another in MPI?

Question

I need to pass the data type vector<vector<int>>, but this is not in the MPI data types. How to create it? And how to use MPI_Recv and MPI_Send in this situation?
This is my code-algorithm (I installed 8 processes):

vector<vector<int>> p1, p2, p3, p4, p5, p6, p7; // our matrices

switch(WORLD_RANK) {
    case 1: {
        p1 = multiStrassen(summation(a11, a22), summation(b11, b22), n);
        // send matrix p1
    }
    case 2: {
        p2 = multiStrassen(summation(a21, a22), b11, n);
        // send matrix p2
    }
    case 3: {
        p3 = multiStrassen(a11, subtraction(b12, b22), n);
        // send matrix p3
    }
    case 4: {
        p4 = multiStrassen(a22, subtraction(b21, b11), n);
        // send matrix p4
    }
    case 5: {
        p5 = multiStrassen(summation(a11, a12), b22, n);
        // send matrix p5
    }
    case 6: {
        p6 = multiStrassen(subtraction(a21, a11), summation(b11, b12), n);
        // send matrix p6
    }
    case 7: {
        p7 = multiStrassen(subtraction(a12, a22), summation(b21, b22), n);
        // send matrix p7
    }
    case 0: {
        // wait for the completion of processes 1-7
        // get matrices p1-p7 and use them
        vector<vector<int>> c11 = summation(summation(p1, p4), subtraction(p7, p5));
        vector<vector<int>> c12 = summation(p3, p5);
        vector<vector<int>> c21 = summation(p2, p4);
        vector<vector<int>> c22 = summation(subtraction(p1, p2), summation(p3, p6));
    }
}

Convert everything into a huge block of integers and send that through MPI instead. — stackoverblown, May 12 '20 at 19:17
Basically you have two options: 1) Send all the matrix rows separately, or 2) store matrix in a 1D array in memory and send it at once. The latter approach is also much more efficient for matrix computations, since the elements are store contiguously (more cache/prefetch-friendly and with much less heap overhead). Vector-of-vectors is most simple for coding, but no one would use it in practice if they care about peformance. — Daniel Langr, May 12 '20 at 19:17
@stackoverblown, How to do it? Is it possible to sample code? I work with square matrices up to 3000x3000 — Samuel, May 12 '20 at 19:28
@DanielLangr, The same question. How to do this, sample code? I don’t quite understand how it will look — Samuel, May 12 '20 at 19:31
@Samuel this is a very frequently asked question on SO under the MPI tag. A quick search will point you to some code and/or recommendations on how to deal with it C++ style. — Gilles Gouaillardet, May 13 '20 at 04:17
@Samuel Very easily, see, e.g., [Map a 2D array onto a 1D array](https://stackoverflow.com/q/2151084/580083). In C++, you can wrap the whole thing by a class with some comfortable getter function for accessing elements. Even better is to use some existing solution for libraries such as [Eigen](https://eigen.tuxfamily.org/dox/group__DenseMatrixManipulation__chapter.html). — Daniel Langr, May 13 '20 at 05:36

Hristo Iliev · Accepted Answer · 2020-05-13T14:25:20.783

This is a topic that pops up at least once per week here. I'd normally close your question as duplicate, but since you are using std::vector and not raw pointers, I guess you deserve a more elaborate answer that touches on a less known but very powerful feature of MPI, namely its datatype system.

First of all, std::vector<std::vector<T>> is not a contiguous type. Think about how it is laid out in memory. std::vector<T> itself is usually implemented as a structure that contains a pointer to an array allocated on the heap and a bunch of bookkeeping information, such as the array capacity and its current size. A vector of vectors is then a structure that contains a pointer to a heap array of structures, each containing a pointer to yet another heap array:

p1 [ data ] ---> p1[0] [ data ] ---> [ p1[0][0] | p1[0][1] | ... ]
   [ size ]            [ size ]
   [ cap. ]            [ cap. ]
                 p1[1] [ data ] ---> [ p1[1][0] | p1[1][1] | ... ]
                       [ size ]
                       [ cap. ]
                 ...

That's two levels of pointer indirection needed just to reach to data for a given row. When you write p1[i][j], the compiler read twice the code for the std::vector<T>::operator[]() and in the end produces the pointer arithmetic and dereference expression that gives the address of that particular matrix element.

MPI is not a compiler extension. It is also not some sort of esoteric template library. It knows nothing of the internal structure of C++ container objects. It is simply a communications library providing just a thin level of abstraction that works in both C and Fortran. In the times MPI was conceived, Fortran didn't even have mainstream support for user-defined aggregate types (structures in C/C++), so the MPI API is heavily array-centric. That said, it doesn't mean MPI can only send arrays. On the contrary, it has a very complex type system that allows you to send arbitrary-shaped objects, if you are willing to invest the additional time and code to do so. Let's examine the different approaches possible.

MPI will happily send data from or receive it in contiguous memory areas. And it is not hard to convert your non-contiguous memory layout in a contiguous one. Instead of an NxN-shaped std::vector<std::vector<T>>, you create a flat std::vector<T> of size N² and then build an auxiliary array of pointers into the flat one:

vector<int> mat_data();
vector<int *> mat;

mat_data.resize(N*N);
for (int i = 0; i < N; i++)
  mat.push_back(&mat_data[0] + i*N);

You may wish to encapsulate this in a new Matrix2D class. With that arrangement, you can still use mat[i][j] to refer to matrix elements, but now all rows are neatly packed one after another in memory. If you wish to send such an object, you simply call:

MPI_Send(mat[0], N*N, MPI_INT, ...);

If you have already allocated an NxN matrix on the receiver side, simply do:

MPI_Recv(mat[0], N*N, MPI_INT, ...);

If you haven't allocated a matrix and you would like to be able to receive arbitrarily sized square matrices, do:

MPI_Status status;
// Probe for a message
MPI_Probe(..., &status);
// Get message size in number of integers
int nelems;
MPI_Get_count(&status, MPI_INT, &nelems);
N = sqrt(nelems);

// Allocate an NxN matrix mat as show above

// Receive the message
MPI_Recv(mat[0], N*N, MPI_INT, status.MPI_SOURCE, status.MPI_TAG, ...);

Unfortunately, it is not always possible to simply swap vector<vector<T>> for the flat array type, especially if you call into external libraries that you have no control over. In that case, you have two further options.

When the matrices are small, it is not unfeasible to manually pack and unpack their data for the purpose of communicating it:

std::vector<int> p1_flat;
p1_flat.reserve(p1.size() * p1.size());
for (auto const &row : p1)
   std::copy(row.begin(), row.end(), std::back_inserter(p1_flat));

MPI_Send(&p1_flat[0], ...);

On the receiver side, you do the opposite.

When the matrices are large, packing and unpacking becomes time and memory consuming activity. Fortunately, MPI has provisions that allow you to skip that part and let it do the packing for you. Since, as mentioned earlier, MPI is just a simple communications library, it doesn't automatically understand language types and it uses hints in the form of MPI datatypes to be able to handle the underlying language types properly. An MPI datatype is something like a recipe, which tells MPI where and how to access data in memory. It is a collection of tuples in the form (offset, primitive type):

offset tells MPI where the respective piece of data is located relative to the address given to functions such as MPI_Send()
primitive type tells MPI what the primitive language datatype at that particular offset is

The simplest MPI datatypes are the predefined ones that correspond to the language scalar types. For example, MPI_INT is under the hood the tuple (0, int), which tells MPI to treat the memory located directly at the address provided as an instance of int. When you tell MPI that you are actually sending a whole array of MPI_INT, it knows that it needs to take one element from the buffer location, then step forward in memory the size of int, take another one, and so on. Not very much unlikely the way data serialisation libraries for C++ work. And just like you can build aggregate types from simpler ones in C++, MPI allows you to construct complex datatypes from simpler ones. For example, the [(0, int), (16, float)] datatype tells MPI to take an int from the buffer address and a float from 16 bytes past the buffer address.

There are two ways to construct datatypes. You can either create an array of simpler types, repeating a certain access pattern (which allows you to also specify uniform gaps in that pattern), or you can create a structure of arbitrary simpler datatypes. What you need is the latter. You need to be able to tell MPI the following: "Listen. I have those N arrays I want to send/receive, but they are scattered unpredictably all over the heap. Here are their addresses. Do something to concatenate them and send/receive them as a single message." And you tell it this by constructing a struct datatype using MPI_Type_create_struct.

The struct datatype constructor takes four input arguments:

int count - the number of blocks (components) in the new datatypes, which in your case is p.size() (p is one of those vector<vector<int>>s);
int array_of_blocklengths[] - again, due to the array nature of MPI, structured datatypes are actually structures of arrays (blocks) of simpler datatypes; here you have to specify an array with elements set to the size of the corresponding row;
MPI_Aint array_of_displacements[] - for each block, MPI needs to know its location relative to the address of the data buffer; this can be both positive and negative and it is easiest to simply pass the addresses of all the arrays here;
MPI_Datatype array_of_types[] - the datatype in each block of the structure; you need to pass an array with elements set to MPI_INT.

In code:

// Block lengths
vector<int> block_lengths;
// Block displacements
vector<MPI_Aint> block_displacements;
// Block datatypes
vector<MPI_Datatype> block_dtypes(p.size(), MPI_INT);

for (auto const &row : p) {
  block_lengths.push_back(row.size());
  block_displacements.push_back(static_cast<MPI_Aint>(&row[0]));
}

// Create the datatype
MPI_Datatype my_matrix_type;
MPI_Type_create_struct(p.size(), block_lengths, block_displacements, block_dtypes, &my_matrix_type);
// Commit the datatatype to make it usable for communication
MPI_Type_commit(&my_matrix_type);

The last step tells MPI that the newly created datatype will be used for communication. If it is only an intermediate step in building an even more complex datatype, the commit step can be omitted.

We can now use my_matrix_type to send the data in p:

MPI_Send(MPI_BOTTOM, 1, my_matrix_type, ...);

What the heck is MPI_BOTTOM? That is the bottom of the address space, basically 0 on many platforms. On most systems it is the same as NULL or nullptr, but free from the semantics of being a pointer to nowhere. We use MPI_BOTTOM here since in the previous step we used the address of each array as the offset of the corresponding block. We can instead subtract the address of the first row:

for (auto const &row : p) {
  block_lengths.push_back(row.size());
  block_displacements.push_back(static_cast<MPI_Aint>(&row[0] - &p[0][0]));
}

We then send p using the following:

MPI_Send(&p[0][0], 1, my_matrix_type, ...);

Note that you can only use this datatype to send the content of p and of no other vector<vector<int>> instance as the offsets there will be different. It doesn't matter if you create my_matrix_type using absolute addresses or offsets from the address of the first row. Therefore, the lifetime of that MPI datatype should be the same as lifetime of p itself.

When no longer needed, my_matrix_type should be freed:

MPI_Type_free(&my_matrix_type);

Exactly the same applies to receiving data in vector<vector<T>>. You first need to resize the outer vector, then resize the inner vectors to prepare the memory. Then build the MPI datatype and use it to receive the data. Then free the MPI datatype if you won't be reusing the same buffer again.

You can neatly pack all of the above steps in an MPI-aware 2D matrix class that frees the MPI datatype in the class destructor. It will also ensure that you'll build a separate MPI datatype for each matrix.

How fast is this compared to the first methods? It is a bit slower than simply using a flat array and it may be slower of faster than packing and unpacking. It is definitely faster than sending each row as a separate message. Also, some network adapters support gathered reads and scattered writes, which means that the MPI library simply needs to pass the offsets in the MPI datatype directly to the hardware and the latter will do the heavy lifting of assembling the scattered arrays into a single message. And that could be done very efficiently on both sides of the communication channel.

Note that you do not have to do the same on both sender and receiver sides. It is perfectly fine to use user-defined MPI datatypes on the sender side and a simple flat array on the receiver side. Or vice versa. MPI doesn't care as long as the sender sends N² MPI_INTs in total and the receiver expects an integer multiple of N² MPI_INTs, i.e., if the send and receive types are congruent.

A word of caution: MPI datatypes are fairly portable and work on many platforms and even allow for communication in heterogeneous environments. But their construction can be trickier than it seems. For example, the block offset is of type MPI_Aint, which is a signed integer of pointer size, which means it can be used to reliably address the whole memory given a base centred in the middle of the address space. But it cannot represent the difference between addresses that are more than half the memory size apart. This is not a problem on most operating systems that do 1:1 split of the virtual address space into user and kernel parts, which includes 32-bit Linux on x86, 32-bit Windows on x86 without the 4-gigabyte tuning, 64-bit versions of Linux, Windows, and macOS on x86 and ARM, as well as on most other 32-bit and 64-bit architectures. But there are systems that either completely separate user and kernel address spaces, a notable example of which is the 32-bit macOS, or can do splits other than 1:1, an example of which is 32-bit Windows with the 4-gigabyte tuning that does 3:1 split. On such systems one should not use MPI_BOTTOM with absolute addresses for the block offsets, nor relative offsets from the first row. Instead, one should derive a pointer to the middle of the address space and compute the offsets from it, then also use that pointer as the buffer address in MPI communication primitives.

Disclaimer: This is a long post and it is possible that there are errors that have slipped under my radar. Expect edits. Also, I claim zero ability to write idiomatic C++.

@Hristolliev, thank you, but I need time to understand your answer — Samuel, May 13 '20 at 17:36

How to transfer a matrix from one process to another in MPI?

1 Answers1