1

I implemented some MPI_Scatterv and MPI_Gatherv routines for a parallel matrix matrix multiplication. Everything works fine for small matrix sizes up to N = 180, if I exceed this size, e.g. N = 184 MPI throws some errors while using MPI_Scatterv.

For the 2D Scatter I used some constructions with MPI_Type_create_subarray and MPI_TYPE_create_resized. Explanations of these constructions can be found in this question.

The minimal example code I wrote filles a matrix A with some values scatters it to the local processes and write the rank number of each process in the local copy of the scattered A. After that the local copies will be gathered to the master process.

#include "mpi.h"

#define N 184 // grid size
#define procN 2  // size of process grid

int main(int argc, char **argv) {
    double* gA = nullptr; // pointer to array
    int rank, size;       // rank of current process and no. of processes

    // mpi initialization
    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    // force to use correct number of processes
    if (size != procN * procN) {
        if (rank == 0) fprintf(stderr,"%s: Only works with np = %d.\n", argv[0], procN *  procN);
        MPI_Abort(MPI_COMM_WORLD,1);
    }

    // allocate and print global A at master process
    if (rank == 0) {
        gA = new double[N * N];
        for (int i = 0; i < N; i++) {
            for (int j = 0; j < N; j++) {
                gA[j * N + i] = j * N + i;
            }
        }

        printf("A is:\n");
        for (int i = 0; i < N; i++) {
            for (int j = 0; j < N; j++) {
                printf("%f ", gA[j * N + i]);
            }
            printf("\n");
        }
    }

    // create local A on every process which we'll process
    double* lA = new double[N / procN * N / procN];

    // create a datatype to describe the subarrays of the gA array
    int sizes[2]    = {N, N}; // gA size
    int subsizes[2] = {N / procN, N / procN}; // lA size
    int starts[2]   = {0,0}; // where this one starts
    MPI_Datatype type, subarrtype;
    MPI_Type_create_subarray(2, sizes, subsizes, starts, MPI_ORDER_C, MPI_DOUBLE, &type);
    MPI_Type_create_resized(type, 0, N / procN * sizeof(double), &subarrtype);
    MPI_Type_commit(&subarrtype);

    // compute number of send blocks
    // compute distance between the send blocks
    int sendcounts[procN * procN];
    int displs[procN * procN];

    if (rank == 0) {
        for (int i = 0; i < procN * procN; i++) {
            sendcounts[i] = 1;
        }
        int disp = 0;
        for (int i = 0; i < procN; i++) {
            for (int j = 0; j < procN; j++) {
                displs[i * procN + j] = disp;
                disp += 1;
            }
            disp += ((N / procN) - 1) * procN;
        }
    }

    // scatter global A to all processes
    MPI_Scatterv(gA, sendcounts, displs, subarrtype, lA,
                 N*N/(procN*procN), MPI_DOUBLE,
                 0, MPI_COMM_WORLD);

    // print local A's on every process
    for (int p = 0; p < size; p++) {
        if (rank == p) {
            printf("la on rank %d:\n", rank);
            for (int i = 0; i < N / procN; i++) {
                for (int j = 0; j < N / procN; j++) {
                    printf("%f ", lA[j * N / procN + i]);
                }
                printf("\n");
            }
        }
        MPI_Barrier(MPI_COMM_WORLD);
    }
    MPI_Barrier(MPI_COMM_WORLD);

    // write new values in local A's
    for (int i = 0; i < N / procN; i++) {
        for (int j = 0; j < N / procN; j++) {
            lA[j * N / procN + i] = rank;
        }
    }

    // gather all back to master process
    MPI_Gatherv(lA, N*N/(procN*procN), MPI_DOUBLE,
                gA, sendcounts, displs, subarrtype,
                0, MPI_COMM_WORLD);

    // print processed global A of process 0
    if (rank == 0) {
        printf("Processed gA is:\n");
        for (int i = 0; i < N; i++) {
            for (int j = 0; j < N; j++) {
                printf("%f ", gA[j * N + i]);
            }
            printf("\n");
        }
    }

    MPI_Type_free(&subarrtype);

    if (rank == 0) {
        delete gA;
    }

    delete lA;

    MPI_Finalize();

    return 0;
}

It can be compiled and run using

mpicxx -std=c++11 -o test test.cpp
mpirun -np 4 ./test

For small N=4,...,180 everything goes fine

A is:
0.000000 6.000000 12.000000 18.000000 24.000000 30.000000 
1.000000 7.000000 13.000000 19.000000 25.000000 31.000000 
2.000000 8.000000 14.000000 20.000000 26.000000 32.000000 
3.000000 9.000000 15.000000 21.000000 27.000000 33.000000 
4.000000 10.000000 16.000000 22.000000 28.000000 34.000000 
5.000000 11.000000 17.000000 23.000000 29.000000 35.000000 
la on rank 0:
0.000000 6.000000 12.000000 
1.000000 7.000000 13.000000 
2.000000 8.000000 14.000000 
la on rank 1:
3.000000 9.000000 15.000000 
4.000000 10.000000 16.000000 
5.000000 11.000000 17.000000 
la on rank 2:
18.000000 24.000000 30.000000 
19.000000 25.000000 31.000000 
20.000000 26.000000 32.000000 
la on rank 3:
21.000000 27.000000 33.000000 
22.000000 28.000000 34.000000 
23.000000 29.000000 35.000000 
Processed gA is:
0.000000 0.000000 0.000000 2.000000 2.000000 2.000000 
0.000000 0.000000 0.000000 2.000000 2.000000 2.000000 
0.000000 0.000000 0.000000 2.000000 2.000000 2.000000 
1.000000 1.000000 1.000000 3.000000 3.000000 3.000000 
1.000000 1.000000 1.000000 3.000000 3.000000 3.000000 
1.000000 1.000000 1.000000 3.000000 3.000000 3.000000 

Here you see the errors when I use N = 184:

Fatal error in PMPI_Scatterv: Other MPI error, error stack:
PMPI_Scatterv(655)..............: MPI_Scatterv(sbuf=(nil), scnts=0x7ffee066bad0, displs=0x7ffee066bae0, dtype=USER<resized>, rbuf=0xe9e590, rcount=8464, MPI_DOUBLE, root=0, MPI_COMM_WORLD) failed
MPIR_Scatterv_impl(205).........: fail failed
I_MPIR_Scatterv_intra(265)......: Failure during collective
I_MPIR_Scatterv_intra(259)......: fail failed
MPIR_Scatterv(141)..............: fail failed
MPIC_Recv(418)..................: fail failed
MPIC_Wait(269)..................: fail failed
PMPIDI_CH3I_Progress(623).......: fail failed
pkt_RTS_handler(317)............: fail failed
do_cts(662).....................: fail failed
MPID_nem_lmt_dcp_start_recv(288): fail failed
dcp_recv(154)...................: Internal MPI error!  cannot read from remote process
Fatal error in PMPI_Scatterv: Other MPI error, error stack:
PMPI_Scatterv(655)..............: MPI_Scatterv(sbuf=(nil), scnts=0x7ffef0de9b50, displs=0x7ffef0de9b60, dtype=USER<resized>, rbuf=0x21a7610, rcount=8464, MPI_DOUBLE, root=0, MPI_COMM_WORLD) failed
MPIR_Scatterv_impl(205).........: fail failed
I_MPIR_Scatterv_intra(265)......: Failure during collective
I_MPIR_Scatterv_intra(259)......: fail failed
MPIR_Scatterv(141)..............: fail failed
MPIC_Recv(418)..................: fail failed
MPIC_Wait(269)..................: fail failed
PMPIDI_CH3I_Progress(623).......: fail failed
pkt_RTS_handler(317)............: fail failed
do_cts(662).....................: fail failed
MPID_nem_lmt_dcp_start_recv(288): fail failed
dcp_recv(154)...................: Internal MPI error!  cannot read from remote process

My guess is that something went wrong using the subarrays but why it works for N=4,...,180? Another possibility is that my array data isn't linear for big data and so scatter can't work anymore. Can problems occour with cache size? I can't believe that MPI is not able to scatter 2D arrays N > 180...

I hope somebody of you can help me. Thanks alot!

Community
  • 1
  • 1
BeiHerta
  • 73
  • 9
  • Does [this](http://stackoverflow.com/questions/39548353/mpi-send-and-receive-dont-work-with-more-then-8182-double) help? – Walter May 17 '17 at 09:40
  • Oh i didn'te recognize your answer at first. I think the problem can be found. I use intel implementation of mpi and there is an issue with the BCAST function for large user defined data types [see](https://software.intel.com/en-us/articles/intel-mpi-library-2017-known-issue-mpi-bcast-hang-on-large-user-defined-datatypes). But i didn't find known issues about Scatterv for large user defined data types... – BeiHerta May 17 '17 at 11:44

1 Answers1

1

First, your code does not work for small N. If I set N=6 and initialise the matrix so that all entries are unique, i.e.

    gA[j * N + i] = j*N+i;

then you can see there is an error:

mpiexec -n 4 ./gathervorig
A is:
0.000000 6.000000 12.000000 18.000000 24.000000 30.000000 
1.000000 7.000000 13.000000 19.000000 25.000000 31.000000 
2.000000 8.000000 14.000000 20.000000 26.000000 32.000000 
3.000000 9.000000 15.000000 21.000000 27.000000 33.000000 
4.000000 10.000000 16.000000 22.000000 28.000000 34.000000 
5.000000 11.000000 17.000000 23.000000 29.000000 35.000000 
la on rank 0:
0.000000 2.000000 7.000000 
1.000000 6.000000 8.000000 
2.000000 7.000000 12.000000 
la on rank 1:
3.000000 5.000000 10.000000 
4.000000 9.000000 11.000000 
5.000000 10.000000 15.000000 
la on rank 2:
18.000000 20.000000 25.000000 
19.000000 24.000000 26.000000 
20.000000 25.000000 30.000000 
la on rank 3:
21.000000 23.000000 28.000000 
22.000000 27.000000 29.000000 
23.000000 28.000000 33.000000 

The error here is not in the code but in the print:

printf("%f ", lA[j * procN + i]);

should be

printf("%f ", lA[j * N/procN + i]);

This now gives the correct answer for the scatter at least:

mpiexec -n 4 ./gathervorig
A is:
0.000000 6.000000 12.000000 18.000000 24.000000 30.000000 
1.000000 7.000000 13.000000 19.000000 25.000000 31.000000 
2.000000 8.000000 14.000000 20.000000 26.000000 32.000000 
3.000000 9.000000 15.000000 21.000000 27.000000 33.000000 
4.000000 10.000000 16.000000 22.000000 28.000000 34.000000 
5.000000 11.000000 17.000000 23.000000 29.000000 35.000000 
la on rank 0:
0.000000 6.000000 12.000000 
1.000000 7.000000 13.000000 
2.000000 8.000000 14.000000 
la on rank 1:
3.000000 9.000000 15.000000 
4.000000 10.000000 16.000000 
5.000000 11.000000 17.000000 
la on rank 2:
18.000000 24.000000 30.000000 
19.000000 25.000000 31.000000 
20.000000 26.000000 32.000000 
la on rank 3:
21.000000 27.000000 33.000000 
22.000000 28.000000 34.000000 
23.000000 29.000000 35.000000 

The gather fails for a similar reason - the local initialisation:

  lA[j * procN + i] = rank;

should be

  lA[j * N/procN + i] = rank;

After this change the gather works too:

Processed gA is:
0.000000 0.000000 0.000000 2.000000 2.000000 2.000000 
0.000000 0.000000 0.000000 2.000000 2.000000 2.000000 
0.000000 0.000000 0.000000 2.000000 2.000000 2.000000 
1.000000 1.000000 1.000000 3.000000 3.000000 3.000000 
1.000000 1.000000 1.000000 3.000000 3.000000 3.000000 
1.000000 1.000000 1.000000 3.000000 3.000000 3.000000 

I think the lesson here is always use unique test data - initialising to i*j makes it hard to spot the initial error even in small systems.

Actually, the real issue was that you set N=4 so that procN = N/procN = 2. I always try and use sizes that lead to odd/unusual numbers, e.g. N=6 gives N/procN = 3 so there is no confusion with procN = 2.

David Henty
  • 1,694
  • 9
  • 11
  • Thank you for your help. All you sad is true. Ich changed the things you mentioned but still for N = 184 my code fails. Can you run the updated code with N = 184? – BeiHerta May 17 '17 at 11:25
  • This sounds suspiciously like some internal or configuration error in MPI. At N=180, each inividual message in the scatter is just under 64KB but for N=184 it is just above 64K. Can you run N=184 with more processes, e.g. a 4x4 grid? – David Henty May 19 '17 at 10:00
  • Yes i can run N=360 using 4x4 proc grid. So what can i do to handle this? – BeiHerta May 20 '17 at 15:26