Using MPI_Gather for 3d array in c++?

Question

I am trying to parallelize a for loop operation which is sandwiched between two for loops.

After the data (3d array) is calculated in each processor I want to collect the data from each processor back to the root node for my further processing. I tried using the MPI_Gather function to get the data back to the root node. Using this function the data is collected back from root processor and but the data is not collected from the other processors.

int main(int argc, char * argv[]) {

  int i,k,l,j;
  int Np = 7, Nz = 7, Nr = 4;
  int mynode, totalnodes;
  MPI_Status status;
  long double ***k_p, ***k_p1; 
  int startvalp,endvalp;

  MPI_Init(&argc,&argv);
  MPI_Comm_size(MPI_COMM_WORLD, &totalnodes);
  MPI_Comm_rank(MPI_COMM_WORLD, &mynode);

  // Allocation of memory
  Allocate_3D_R(k_p,(Nz+1),(Np+1),(Nr+1));
  Allocate_3D_R(k_p1,(Nz+1),(Np+1),(Nr+1));

  // startvalp represents the local starting value for each processor
  // endvalp represents the local ending value for each processor

  startvalp = (Np+1)*mynode/totalnodes - 0;
  endvalp = startvalp + (((Np+1)/totalnodes) -1);

  for(l = 0 ; l <= 1 ; l++){
    for(k=startvalp; k<=endvalp; k++){
      // for loop parallelized between the processors 
      // original loop: for(k=0; k<= Np; k++) 
      for(i=0; i<=1; i++){
        k_p[i][k][l] =  l+k+i;
      }
    }
  }

  // For Np = 7  and for two processors ; 
  //   k = 0 - 3 is calculated in processor 0;
  //   k = 4 - 7 is calculated in processor 1;

  // Now I need to collect the value of k_p from processor 1 
  // back to the root processor.
  // MPI_Gather function  is used.

  for(l = 0 ; l <= 1 ; l++){
    for(k=startvalp; k<=endvalp; k++){      
      for(i=0; i<=1; i++){
        MPI_Gather(&(k_p[i][k][l]),1, MPI_LONG_DOUBLE,&(k_p1[i][k][l]),1, MPI_LONG_DOUBLE, 0, MPI_COMM_WORLD);
      }
    }
  }

  // Using this the k_p is collected from root processor and stored
  // in the k_p1 variable, but from the slave processor it is not
  // collected back to the root processor.

  if(mynode == 0){
    for(l = 0 ; l <= 1 ; l++){
      for(k=0; k<=Np; k++){
        for(i=0i<=1;i++){
          cout << "Processor "<<mynode;
          cout << ": k_p["<<i<<"]["<<k<<"]["<<l<<"] = " <<k_p1[i][k][l]<<endl;
        }
      }
    }
  }

  MPI_Finalize();

} // end of main


void Allocate_3D_R(long double***& m, int d1, int d2, int d3) {

  m=new long double** [d1];
  for (int i=0; i<d1; ++i) {
    m[i]=new long double* [d2];
    for (int j=0; j<d2; ++j) {
      m[i][j]=new long double [d3];
      for (int k=0; k<d3; ++k) {
        m[i][j][k]=0.0;
      }
    }
  }
}

Here is the output :

Processor 0: k_p[0][0][0] = 0
Processor 0: k_p[1][0][0] = 1
Processor 0: k_p[0][1][0] = 1
Processor 0: k_p[1][1][0] = 2
Processor 0: k_p[0][2][0] = 2
Processor 0: k_p[1][2][0] = 3
Processor 0: k_p[0][3][0] = 3
Processor 0: k_p[1][3][0] = 4
Processor 0: k_p[0][4][0] = 0
Processor 0: k_p[1][4][0] = 0
Processor 0: k_p[0][5][0] = 0
Processor 0: k_p[1][5][0] = 0
Processor 0: k_p[0][6][0] = 0
Processor 0: k_p[1][6][0] = 0
Processor 0: k_p[0][7][0] = 0
Processor 0: k_p[1][7][0] = 0
Processor 0: k_p[0][0][1] = 1
Processor 0: k_p[1][0][1] = 2
Processor 0: k_p[0][1][1] = 2
Processor 0: k_p[1][1][1] = 3
Processor 0: k_p[0][2][1] = 3
Processor 0: k_p[1][2][1] = 4
Processor 0: k_p[0][3][1] = 4
Processor 0: k_p[1][3][1] = 5
Processor 0: k_p[0][4][1] = 0
Processor 0: k_p[1][4][1] = 0
Processor 0: k_p[0][5][1] = 0
Processor 0: k_p[1][5][1] = 0
Processor 0: k_p[0][6][1] = 0
Processor 0: k_p[1][6][1] = 0
Processor 0: k_p[0][7][1] = 0
Processor 0: k_p[1][7][1] = 0

The data from the root processor is transferred but not from the other processor. I tried using MPI_Send and MPI_Recv functions and did not encounter the above problem but for large values of for loop it takes more time.

Therefore can anyone provide a solution to the above problem?

perhaps root itself sends the zeros to himself. Try only receiving when being root — Ronny Brendel, Apr 09 '11 at 17:22

score 0 · Accepted Answer · edited May 23 '17 at 12:04

The issues here are actually similar to the issues in 2d: MPI_Type_create_subarray and MPI_Gather ; there's a very lengthy answer there that covers most of the crucial points.

Gathering multidimensional array sections is trickier than just doing 1d arrays, because the data you're gathering actually overlaps. Eg, the first row from rank 1 comes between the first and second rows of rank 0. So you need to (a) use mpi_gatherv, so you can specify the displacments, and (b) set the extents of the data types explicitly to facilitate the overlapping.

Understanding the sending and receiving of complex data structures (in MPI, or in anything else) is all about understanding the layout of data in memory -- which is crucial for getting high performance out of your code anyway.

Speaking of layout of memory, your Allocate3d won't work for the purposes here; the problem is that it allocates memory that may not be contiguous. There's no guarantee that if you allocate a 10x10x10 array this way that element [1][0][0] comes immediately after element [0][9][9]. This is a common problem in C/C++, which doesn't come with any built-in concept of multidimensional arrays. You'll need to do something like this:

void Allocate_3D_R(long double***& m, int d1, int d2, int d3) {

  m=new long double** [d1];
  for (int i=0; i<d1; ++i) {
    m[i]=new long double* [d2];
  }
  m[0][0] = new long double[d1*d2*d3];
  for (int i=0; i<d1; ++i) {
    for (int j=0; j<d2; ++j) {
      if (i!=0 && j!=0)
        m[i][j]=&(m[0][0][(i*d2+j)*d3];
      for (int k=0; k<d3; ++k) {
         m[i][j][k]=0.0;
      }
    }
  }

plus or minus -- that is, you'll need to allocate a contiguous d1*d2*d3 chunk of memory, and then point the array indicies to the appropriate places in that contiguous memory.

Using MPI_Gather for 3d array in c++?

1 Answers1