MPI_Gather() the central elements into a global matrix

Question

This is a follow-up question from MPI_Gather 2D array. Here is the situation:

id = 0 has this submatrix

|16.000000| |11.000000| |12.000000| |15.000000|
|6.000000| |1.000000| |2.000000| |5.000000|
|8.000000| |3.000000| |4.000000| |7.000000|
|14.000000| |9.000000| |10.000000| |13.000000|
-----------------------

id = 1 has this submatrix

|12.000000| |15.000000| |16.000000| |11.000000|
|2.000000| |5.000000| |6.000000| |1.000000|
|4.000000| |7.000000| |8.000000| |3.000000|
|10.000000| |13.000000| |14.000000| |9.000000|
-----------------------

id = 2 has this submatrix

|8.000000| |3.000000| |4.000000| |7.000000|
|14.000000| |9.000000| |10.000000| |13.000000|
|16.000000| |11.000000| |12.000000| |15.000000|
|6.000000| |1.000000| |2.000000| |5.000000|
-----------------------

id = 3 has this submatrix

|4.000000| |7.000000| |8.000000| |3.000000|
|10.000000| |13.000000| |14.000000| |9.000000|
|12.000000| |15.000000| |16.000000| |11.000000|
|2.000000| |5.000000| |6.000000| |1.000000|
-----------------------

The global matrix:

|1.000000| |2.000000| |5.000000| |6.000000|
|3.000000| |4.000000| |7.000000| |8.000000|
|11.000000| |12.000000| |15.000000| |16.000000|
|-3.000000| |-3.000000| |-3.000000| |-3.000000|

What I am trying to do is gather only the central elements (the ones not in the borders) in the global grid, so the global grid should like this:

 |1.000000| |2.000000| |5.000000| |6.000000|
 |3.000000| |4.000000| |7.000000| |8.000000|
 |9.000000| |10.000000| |13.000000| |14.000000|
 |11.000000| |12.000000| |15.000000| |16.000000|

and not like the one I am getting. This is the code I have:

float **gridPtr;
float **global_grid;
lengthSubN = N/pSqrt; // N is the dim of global gird and pSqrt the sqrt of the number of processes
MPI_Type_contiguous(lengthSubN, MPI_FLOAT, &rowType);
MPI_Type_commit(&rowType);
if(id == 0) {
    MPI_Gather(&gridPtr[1][1], 1, rowType, global_grid[0], 1, rowType, 0, MPI_COMM_WORLD);
    MPI_Gather(&gridPtr[2][1], 1, rowType, global_grid[1], 1, rowType, 0, MPI_COMM_WORLD);
} else {
    MPI_Gather(&gridPtr[1][1], 1, rowType, NULL, 0, rowType, 0, MPI_COMM_WORLD);
    MPI_Gather(&gridPtr[2][1], 1, rowType, NULL, 0, rowType, 0, MPI_COMM_WORLD);
}
...
float** allocate2D(float** A, const int N, const int M) {
    int i;
    float *t0;

    A = malloc(M * sizeof (float*)); /* Allocating pointers */
    if(A == NULL)
        printf("MALLOC FAILED in A\n");
    t0 = malloc(N * M * sizeof (float)); /* Allocating data */
    if(t0 == NULL)
        printf("MALLOC FAILED in t0\n");
    for (i = 0; i < M; i++)
        A[i] = t0 + i * (N);

    return A;
}

EDIT:

Here is my attempt without MPI_Gather(), but with subarray:

    MPI_Datatype mysubarray;

    int starts[2] = {1, 1};
    int subsizes[2]  = {lengthSubN, lengthSubN};
    int bigsizes[2]  = {N_glob, M_glob};
    MPI_Type_create_subarray(2, bigsizes, subsizes, starts,
                             MPI_ORDER_C, MPI_FLOAT, &mysubarray);
    MPI_Type_commit(&mysubarray);
    MPI_Isend(&(gridPtr[0][0]), 1, mysubarray, 0, 3, MPI_COMM_WORLD, &req[0]);
    MPI_Type_free(&mysubarray);
    MPI_Barrier(MPI_COMM_WORLD);
    if(id == 0) {
      for(i = 0; i < p; ++i) {
        MPI_Irecv(&(global_grid[i][0]), lengthSubN * lengthSubN, MPI_FLOAT, i, 3, MPI_COMM_WORLD, &req[0]);
      }
    }
    if(id == 0)
            print(global_grid, N_glob, N_glob);

but the result is:

|1.000000| |2.000000| |3.000000| |4.000000|
|5.000000| |6.000000| |7.000000| |8.000000|
|9.000000| |10.000000| |11.000000| |12.000000|
|13.000000| |14.000000| |15.000000| |16.000000|

which is not exactly what I want. I have to find a way to say to recv that it should place the data in another fashion. So, if I do:

MPI_Irecv(&(global_grid[0][0]), 1, mysubarray, 0, 3, MPI_COMM_WORLD, &req[0]);

then I would get:

|-3.000000| |-3.000000| |-3.000000| |-3.000000|
|-3.000000| |1.000000| |2.000000| |-3.000000|
|-3.000000| |3.000000| |4.000000| |-3.000000|
|-3.000000| |-3.000000| |-3.000000| |-3.000000|

If you didn't insist on using arrays of pointers to represent the 2D arrays, this would be a trivial exercise. Why not use pitched linear memory instead? — talonmies, Dec 31 '15 at 11:08
@talonmies I am trying to augment a code that was written back in 2012, so it isn't my choice, really. However, I could flatten the 2D arrays just before the gather operation, that wouldn't cost that much I guess. So if you have a proposition on that approach, please feel free to post an answer. — gsamaras, Dec 31 '15 at 11:10
For example @talonmies I could have every process create a 1D array that would contain only the central elements, for example the 4th process would have `{13, 14, 15, 16}`. However, it's still not clear for me how to proceed. — gsamaras, Dec 31 '15 at 11:36
Possible duplicate of [MPI\_Type\_create\_subarray and MPI\_Send](http://stackoverflow.com/questions/5583540/mpi-type-create-subarray-and-mpi-send) — talonmies, Dec 31 '15 at 12:09
I can not understand the question @talonmies, I am terribly bad in Italian. If you have any insight please let me know. — gsamaras, Dec 31 '15 at 12:16
I found this http://stackoverflow.com/questions/13345147/how-to-use-mpi-type-create-subarray which is actually readable. Working on it.. — gsamaras, Dec 31 '15 at 12:32
@talonmies I updated my post with a relevant attempt. Any idea? — gsamaras, Dec 31 '15 at 13:37
Please add the code of the `allocate2D` function because whithout it, no one will understand how your 2D like "array" will work. Also specify the actual value of `lengthSubN` which should be 2. — Martin Zabel, Dec 31 '15 at 14:01
Your non-blocking calls are all flawed since you are storing the request handles into the same variable `req[0]`. Thus, you are not able to wait on any of them except the last one and MPI allows the implementations to postpone actual communication until the wait/test calls (if nothing else progresses them, that is). — Hristo Iliev, Jan 03 '16 at 11:04

score 2 · Accepted Answer · answered Dec 31 '15 at 14:57

I cannot give a full solution, but I will explain why your original example using MPI_Gather does not work as expected.

With lengthSubN=2 you defined a new datatype of 2 floats which are stored adjacent in memory at this line:

MPI_Type_contiguous(lengthSubN, MPI_FLOAT, &rowType);

Now, let's take a look at the first MPI_Gather call which is:

if(id == 0) {
    MPI_Gather(&gridPtr[1][1], 1, rowType, global_grid[0], 1, rowType, 0, MPI_COMM_WORLD);
} else {
    MPI_Gather(&gridPtr[1][1], 1, rowType, NULL, 0, rowType, 0, MPI_COMM_WORLD);
}

It takes 1 rowType which is 2 adjacent float starting at element gridPtr[1][1] from each rank. These are the values:

id 0:  1.0   2.0
id 1:  5.0   6.0
id 2:  9.0  10.0
id 3: 13.0  14.0

and places them adjacent in the receive buffer pointed to by global_grid[0]. This pointer actually points to the start of the first row, so that the memory is filled with:

 1.0   2.0   5.0   6.0   9.0  10.0  13.0  14.0

But, global_grid has only 4 columns per row, so that the last 4 value wrap to the second row pointed to by global_grid[1] (*). This may even by undefined behaviour. Thus, after this MPI_Gather the contents of global_grid is:

 1.0   2.0   5.0   6.0 
 9.0  10.0  13.0  14.0
-3.0  -3.0  -3.0  -3.0
-3.0  -3.0  -3.0  -3.0

The second MPI_Gather works the same way and starts writing at the second row of global_grid:

 3.0   4.0   7.0   8.0  11.0  12.0  15.0  16.0

It thus overwrites some values above and the result is as observed:

 1.0   2.0   5.0   6.0 
 3.0   4.0   7.0   8.0
11.0  12.0  15.0  16.0
-3.0  -3.0  -3.0  -3.0

(*) allocate2d actually allocates continous memory for the 2 dimensional data buffer.

MPI_Gather() the central elements into a global matrix

1 Answers1

Linked