Scattering and Gathering (Scatterv & Gatherv) portions of a 2D array using C MPI

Question

so I am trying to set up a communication routine where I transfer some 2D arrays from a number of processors back to root using 'MPI_Gatherv'.

I have been trying to do this using scatterv and gatherv, with a simple example first; a 4x4 array which I am trying to partition into four 2x2 arrays and then scatter.

I have been trying to use the Scatterv function to partition my 4x4 across 4 processors. I have currently reached a point where the root processor manages to print out its 2x2 array, but then I get an segfault error once the next processor tries to print out its local data, if I don't try to print the local array though, I don't any errors. Here is my code:

    #include <stdio.h>
    #include <stdlib.h>
    #include <mpi.h>

    int main (int argc, char** argv) {

    int rank, numprocs;

    MPI_Init(&argc, &argv);

    MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    int ndim = 4;
    int **ga = NULL;

    // Create global array. size is 4x4, Only for Root.
    if (rank == 0) {

            ga = malloc(ndim*sizeof(int*));

            for (int i=0; i < ndim; i++)
                    ga[i] = malloc(ndim*sizeof(int));

            for (int i=0; i < ndim; i++) {

                    for (int j=0; j < ndim; j++) {

                            if (i > 0) {
                                    ga[i][j] = i*ndim + j;

                            } else {
                                    ga[i][j] =i+j;

                            }
                    }
            }
    }

    //print global array.
    if (rank == 0) {

            printf("Send array:\n");
            for (int i=0; i < ndim; i++) {

                    for (int j=0; j < ndim; j++) {

                            printf(" %d ", ga[i][j]);
                    }

                    printf("\n");
            }
    }

    //Create local arrays on all procs.
    int **la = NULL;
    //local array size is 2x2.
    int ndim_loc = ((ndim*ndim)/numprocs)/2;

    la = (int**)malloc(ndim_loc*sizeof(int*));
    for (int i=0; i< ndim_loc; i++)
            la[i] = (int*)malloc(ndim_loc*sizeof(int));

     if (rank == 0) {

            printf("recieve array:\n");

            for (int i=0; i <ndim_loc; i++) {

                    for(int j=0; j<ndim_loc; j++) {

                            la[i][j] = 0;

                            printf(" %d ", la[i][j]);
                    }

                    printf("\n");
            }
    }

    // global size  
    int sizes[2] = {ndim, ndim};

    //local size, 4 procs, ndim = 4. each proc has a 2x2.
    int subsizes[2] = {ndim_loc, ndim_loc};
    int starts[2] = {0, 0};               //Set starting point of subarray in global array.

    if (rank == 0) {

            printf("Global arr dims = [%d,%d]\n", sizes[0], sizes[1]);
            printf("Sub arr dims = [%d,%d]\n", subsizes[0], subsizes[1]);
            printf("start point in global = [%d,%d]\n", starts[0], starts[1]);
    }

    //Preparing MPI send types.
    MPI_Datatype sub_arr, type;
    MPI_Type_create_subarray(2, sizes, subsizes, starts, MPI_ORDER_C, MPI_INT, &sub_arr);
    MPI_Type_create_resized(sub_arr, 0, 2*sizeof(int), &type); 
    //Re-sizing block extent from one int to two ints, i.e. 1 block extent is one row of the sub array       
    MPI_Type_commit(&type);

    //Setting up arrays for sendcounts (each processor receives 1 sub array).
    int scounts[numprocs];

    //Displacements relative to global array[0][0].
    // .___.___.___.___.
    // |[0]|   |[1]|   |  [i] marks the starting position of the sub array in the global one for processor i.
    // |___|___|___|___|  So, the displacements (in units of the new block extent) are: {0,1,4,5}
    // |   |   |   |   |
    // |___|___|___|___|
    // |[2]|   |[3]|   |
    // |___|___|___|___|
    // |   |   |   |   |
    // |___|___|___|___|

    int displs[numprocs];

    for (int i=0; i<numprocs; i++) {

            scounts[i] = 1;

            if (i > 0 && i%2 == 0) {

                    displs[i] = displs[i-1] + 3;

            } else if (i == 0) {

                    displs[i] = 0;

            } else {

                    displs[i] = displs[i-1] + 1;

            }
    }

    MPI_Barrier(MPI_COMM_WORLD);

    printf("I AM RANK %d, displ = %d, scount = %d\n", rank, displs[rank], scounts[rank]);

    //Sending uses the newly defined MPI_TYPE, receiving side is 4 MPI_INTs.
    MPI_Scatterv(&ga, scounts, displs, type, &la, (ndim_loc*ndim_loc), MPI_INT, 0, MPI_COMM_WORLD);

    MPI_Barrier(MPI_COMM_WORLD);

    //print local array.    
    printf("RANK = %d, local data:\n", rank);

    for (int i=0; i<ndim_loc; i++) {

            for (int j=0; j<ndim_loc; j++) {

                    printf("  %d  ", la[i][j]);
            }

            printf("\n");
    }

I have found a few questions like these answered (such as: sending blocks of 2D array in C using MPI, MPI C - Gather 2d Array Segments into One Global Array ), which helped me greatly in understanding what is happenning in the actual memory layout. But I can't seem to get this scatterv to work and I am not sure what I am doing wrong.

In one of these answers, the solution is to allocate the memory of the receiving processor backwards. i.e.

    int **la = NULL;
    int *la_pre = NULL;
    int ndim_loc = ((ndim*ndim)/numprocs)/2;

    la_pre = malloc((ndim_loc*ndim_loc)*sizeof(int));
    la = malloc(ndim_loc*sizeof(int*));

    for (int i=0; i<ndim_loc; i++)
            la[i] = &(la_pre[i*ndim_loc]);

Unfortunately, this doesn't seem to work, I get the same output as before:

    mpirun -np 4 ./a.out 
    Send array:
    0  1  2  3 
    4  5  6  7 
    8  9  10  11 
    12  13  14  15 
    recieve array:
    0  0 
    0  0 
    Global arr dims = [4,4]
    Sub arr dims = [2,2]
    start point in global = [0,0]
    I AM RANK 0, displ = 0, scount = 1
    I AM RANK 1, displ = 1, scount = 1
    I AM RANK 2, displ = 4, scount = 1
    I AM RANK 3, displ = 5, scount = 1
    RANK = 0, local data:  
     0   1  
     4   5 
    RANK = 1, local data:
    RANK = 2, local data:
    RANK = 3, local data:

    ===================================================================================
    =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
    =   PID 928 RUNNING AT login03
    =   EXIT CODE: 11
    =   CLEANING UP REMAINING PROCESSES
    =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
    ===================================================================================
       Intel(R) MPI Library troubleshooting guide:
          https://software.intel.com/node/561764
    ===================================================================================

Any help with this would be really appreciated!

EDIT: A suggestion from a user, to allocate my memory contiguously rather than 'jaggedly'. Here are the memory allocations, Sender:

    int ndim = 4;
    int **ga = NULL;
    int *ga_pre = NULL;

    // Create global array. size is 4x4
    if (rank == 0) {

            //Contiguous allocation.
            ga_pre = malloc((ndim*ndim)*sizeof(int));
            ga = malloc(ndim*sizeof(int*));

            for (int i=0; i<ndim; i++)
                    ga[i] = &(ga_pre[ndim*i]);

Receiver:

     //Create local arrays on all procs.
    int **la = NULL;
    int *la_pre = NULL;
    int ndim_loc = ((ndim*ndim)/numprocs)/2;

    //Contiguous allocation
    la_pre = malloc((ndim_loc*ndim_loc)*sizeof(int));
    la = malloc(ndim_loc*sizeof(int*));

    for (int i=0; i<ndim_loc; i++)
            la[i] = &(la_pre[i*ndim_loc]);

    if (rank == 0) {

            printf("recieve array:\n");

            for (int i=0; i <ndim_loc; i++) {

                    for(int j=0; j<ndim_loc; j++) {

                            la[i][j] = 0;

                            printf(" %d ", la[i][j]);
                    }

                    printf("\n");
            }
    }

And the Scatterv call:

    MPI_Scatterv(&ga[0][0], scounts, displs, type, &la[0][0], (ndim_loc*ndim_loc), MPI_INT, 0, MPI_COMM_WORLD);

My new output:

    mpirun -np 4 ./a.out 
    Send array:
     0  1  2  3 
     4  5  6  7 
     8  9  10  11 
     12  13  14  15 
    recieve array:
     0  0 
     0  0 
    start point in global = [0,0]
    Global arr dims = [4,4]
    Sub arr dims = [2,2]
    I AM RANK 0, displ = 0, scount = 1
    I AM RANK 1, displ = 1, scount = 1
    I AM RANK 2, displ = 4, scount = 1
    I AM RANK 3, displ = 5, scount = 1

    ===================================================================================
    =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
    =   PID 23484 RUNNING AT login04
    =   EXIT CODE: 11
    =   CLEANING UP REMAINING PROCESSES
    =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

UPDATE: I have tried using the functions for contiguous memory allocation from another user @ryyker:

    int ndim = 4;
    int **ga = NULL;

    // Create global array. size is 4x4
    if (rank == 0) {

            //contiguous allocation
            ga = Create2D(ndim, ndim);

            for (int i=0; i < ndim; i++) {

                    for (int j=0; j < ndim; j++) {

                            if (i > 0) {
                                    ga[i][j] = i*ndim + j;

                            } else {
                                    ga[i][j] =i+j;

                            }
                    }
            }
    }

    //print global array.
    if (rank == 0) {

            printf("Send array:\n");
            for (int i=0; i < ndim; i++) {

                    for (int j=0; j < ndim; j++) {

                            printf(" %d ", ga[i][j]);
                    }

                    printf("\n");
            }
    }

    //Create local arrays on all procs.
    int **la = NULL;
    int ndim_loc = ((ndim*ndim)/numprocs)/2;

    //Contiguous allocation
    la = Create2D(ndim_loc, ndim_loc);

    if (rank == 0) {

            printf("recieve array:\n");

            for (int i=0; i <ndim_loc; i++) {

                    for(int j=0; j<ndim_loc; j++) {

                            la[i][j] = 0;

                            printf(" %d ", la[i][j]);
                    }

                    printf("\n");
            }
    }

Code Output:

    mpirun -np 4 ./a.out 
    Send array:
     0  1  2  3 
     4  5  6  7 
     8  9  10  11 
     12  13  14  15 
    recieve array:
     0  0 
     0  0 
    start point in global = [0,0]
    Global arr dims = [4,4]
    Sub arr dims = [2,2]
    I AM RANK 0, displ = 0, scount = 1
    I AM RANK 2, displ = 4, scount = 1
    I AM RANK 1, displ = 1, scount = 1
    I AM RANK 3, displ = 5, scount = 1

    ===================================================================================
    =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
    =   PID 27303 RUNNING AT login04
    =   EXIT CODE: 11
    =   CLEANING UP REMAINING PROCESSES
    =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
    ===================================================================================
       Intel(R) MPI Library troubleshooting guide:
          https://software.intel.com/node/561764
    ===================================================================================

UPDATE: Thanks for all the suggestions, I am going to see if I can do this task with static arrays first. At which point I will start to play around with different types of memory allocation routines. I will update the post once I figure out what is going wrong.

UPDATE (5/12/2019):

Ok, so the problem was with my memory allocation, as expected. It turns out that using the type of memory allocation I used earlier (in the code snippet above) does not create truly contiguous memory blocks, the arrays are still 'jagged'. This is why the call to scatterv caused no errors (4 ints were sent and 4 ints were received on each processor), but when I tried to print the array, it would cause a Segfault. The data was placed in memory but the references to memory addresses from my print statement was done in a manner that implies a jagged array, not a contiguous one. According to this webpage https://www.cs.swarthmore.edu/~newhall/unixhelp/C_arrays.html referenced to me by @ryyker.

So the way to allocate and reference memory contiguously is to basically allocate a 1D array and reference elements in print/initialise statements using row and column indicies (i,j) and the 'stride' of the array, which is the # of elements in one row. The code follows:

    int main (int argc, char** argv) {

    int rank, numprocs;

    MPI_Init(&argc, &argv);

    MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    int ndim = 4;
    int *ga = NULL;

    // Create global array. size is 4x4
    if (rank == 0) {

            //contiguous allocation
            ga = malloc((ndim*ndim)*sizeof(int*));

            for (int i=0; i < ndim; i++) {

                    for (int j=0; j < ndim; j++) {

                            //ndim is the 'stride'
                            if (i > 0) {
                                    ga[i*ndim +j] = i*ndim +j;

                            } else {
                                    ga[i*ndim +j] =i+j;

                            }
                    }
            }
    }

    //print global array.
    if (rank == 0) {

            printf("Send array:\n");
            for (int i=0; i < ndim; i++) {

                    for (int j=0; j < ndim; j++) {

                            printf(" %d ", ga[i*ndim +j]);
                    }

                    printf("\n");
            }
    }

    //Create local arrays on all procs.
    int *la = NULL;

    int ndim_loc = ((ndim*ndim)/numprocs)/2;

    //Contiguous allocation
    la = malloc((ndim_loc*ndim_loc)*sizeof(int*));

    if (rank == 0) {

            printf("recieve array:\n");

            for (int i=0; i <ndim_loc; i++) {

                    //ndim_loc is the 'stride' of the sub-array
                    for(int j=0; j<ndim_loc; j++) {

                            la[i*ndim + j] = 0;

                            printf(" %d ", la[i*ndim +j]);
                    }

                    printf("\n");
            }
    }

    // global size  
    int sizes[2] = {ndim, ndim};

    //local size, 2 procs, ndim = 4. each proc has a 2x2.
    int subsizes[2] = {ndim_loc, ndim_loc};
    int starts[2] = {0, 0};

    int scounts[numprocs];
    int displs[numprocs];

    for (int i=0; i<numprocs; i++) {
           scounts[i] = 1;

           if (i > 0 && i%2 == 0) {

            displs[i] = displs[i-1] + 3;

    } else if (i == 0) {

            displs[i] = 0;
    } else {

            displs[i] = displs[i-1] + 1;
    }

    if (rank == 0) {

    printf("start point in global = [%d,%d]\n", starts[0], starts[1]);
    printf("Global arr dims = [%d,%d]\n", sizes[0], sizes[1]);
    printf("Sub arr dims = [%d,%d]\n", subsizes[0], subsizes[1]);

    }

    MPI_Datatype sub_arr, type;
    MPI_Type_create_subarray(2, sizes, subsizes, starts, MPI_ORDER_C, MPI_INT, &sub_arr);
    MPI_Type_create_resized(sub_arr, 0, 2*sizeof(int), &type);
    MPI_Type_commit(&type);

    MPI_Barrier(MPI_COMM_WORLD);

    printf("I AM RANK %d, displ = %d, scount = %d\n", rank, displs[rank], scounts[rank]);

    MPI_Scatterv(ga, scounts, displs, type, la, (ndim_loc*ndim_loc), MPI_INT, 0, MPI_COMM_WORLD);

    MPI_Barrier(MPI_COMM_WORLD);

    //print local array.    
    printf("RANK = %d, local data:\n", rank);
    for (int i=0; i<ndim_loc; i++) {
             for (int j=0; j<ndim_loc; j++) {
                   printf("  %d  ", la[i*ndim_loc +j]);
              }
              printf("\n");
     }

With the Output:

    mpirun -np 4 ./a.out 
    Send array:
     0  1  2  3 
     4  5  6  7 
     8  9  10  11 
     12  13  14  15 
     recieve array:  
     0  0 
     0  0 
    start point in global = [0,0]
    Global arr dims = [4,4]
    Sub arr dims = [2,2]
    I AM RANK 0, displ = 0, scount = 1
    I AM RANK 1, displ = 1, scount = 1
    I AM RANK 2, displ = 4, scount = 1
    I AM RANK 3, displ = 5, scount = 1
    RANK = 0, local data:
      0    1  
      4    5  
    RANK = 1, local data:
      2    3  
      6    7  
    RANK = 2, local data:
     8    9  
     12    13     
    RANK = 3, local data:
     10    11  
     14    15

You have to allocate your 2d arrays in contiguous memory. And pass the receive buffer correctly. Please update your code and see how things go. — Gilles Gouaillardet, Dec 03 '19 at 13:37
@GillesGouaillardet I have just updated my code, where the global and local arrays are allocated contiguously (just as in my code snippet), I have also tried contiguous allocation for the local array and jagged for the global array. They are still is giving me the exact same output as before. Do you think it could be the issue of the variable 'starts[2]' , should these reference the unique start points for each processor or is it only necessarry to have {0,0} referenced? Edit: I have also tried &ga[0][0]/&la[0][0] as arguements to scatterv, to no avail. — user123, Dec 03 '19 at 14:13
Please append the updated code to the question, so I can check your changes. — Gilles Gouaillardet, Dec 03 '19 at 14:18
Where in your code do you print "recieve array:" ( with a typo...) ? If you expect my help, please post the full code you compiled and ran. And if I cannot spot the error, I will use a debugger to track it down. — Gilles Gouaillardet, Dec 03 '19 at 14:45
I have deleted my answer, as it was not really an answer for your specific needs, i.e. a _[truly contiguous block of memory locations](https://www.cs.swarthmore.edu/~newhall/unixhelp/C_arrays.html)_. I was mistaken in saying my method resulted in a truly contiguous block. However, I hope the link helps you do that. (It specifically calls out methods that will yield contiguous blocks, and methods that will not.) — ryyker, Dec 03 '19 at 15:32
I would try to get this working using static arrays first. Something like `int ga[4][4] = {0};` — ptb, Dec 03 '19 at 21:35

Scattering and Gathering (Scatterv & Gatherv) portions of a 2D array using C MPI

0 Answers0