3

I'm implementing a distributed image(greyscale) convolution using MPI. My existing pattern is to read the image as a 1D flattened array at the root process and then scatter them to all the processes (row-decomposition) and then do a MPI_Gather at the root process and then write the image out again as a 1D flattened array. Obviously, this doesn't give the expected results since with image convolution, the situation gets tricky at the boundaries.

So, to improve upon the aforementioned pattern, I want to implement the so called ghost cell exchange pattern wherein the processes exchange their rows in the ghost rows. In pseudocode:

if (rank == 0) {
    src = null
    dest = rank + 1
}

if (rank == size - 1) {
    src = rank - 1
    dest = null
} else {
   src = rank - 1
   dest = rank + 1
}

MPI_SendRecv(&sendbuf[offset], slen, dest..
             &recvbuf[offset], rlen, src);

How do I allocate memory for the "ghost rows" on each process? Should I pre-allocate the memory and then scatter? I don't want to go for a "custom-datatype" solution since it's an overkill for the scope of the problem I'm considering.

nbro
  • 15,395
  • 32
  • 113
  • 196

1 Answers1

2

Ideally, the ghost cells should be part of the same memory block as your normal cells. That way, you can keep the addressing scheme simple. In that scheme, the image is distributed by multiples of complete rows, using MPI_Scatter and MPI_Gather. In a non-border rank you allocate enough memory for two additional ghost rows:

height = total_hight / ranks;
std::vector<float> data(width * (height + 2));
float* image = &data[width];
float* ghost_north = &data[0]
float* ghost_south = &data[width * (height + 1)]
float* inner_north = image;
float* inner_south = &image[width * (height - 1)]
MPI_Scatter(root_image, width * height, MPI_FLOAT,
            image, width * height, MPI_FLOAT, ...);
...
iterations {
    MPI_SendRecv(inner_north, width, MPI_FLOAT, north, tag,
                 ghost_north, width, MPI_FLOAT, north, tag, ...)
    MPI_SendRecv(inner_south, width, MPI_FLOAT, south, tag,
                 ghost_south, width, MPI_FLOAT, south, tag, ...)
   ... compute ...
}
MPI_Gather(image, width * height, MPI_FLOAT,
           root_image, width * height, MPI_FLOAT, ...);

This pseudocode does not consider special border cases.

The issue with the simple one-dimensional splitting, is that communication cost and additional halo data is non-optimal. Especially for smaller images and a larger number of participating ranks.

Here is an excellent example by Rolf Rabenseifner regarding data decompoisition and halo communication methods with MPI. He also explains how you can improve communication methods. For 2D decomposition, you will need derived MPI datatypes for both initial communication and vertical boundaries.

Zulan
  • 21,896
  • 6
  • 49
  • 109
  • Thanks for a complete, well defined answer. You mentioned something about the communication cost. Now if I was to test this model on increasing image sizes, my performance should then degrade, right? – Ankit Kulshrestha Mar 17 '17 at 13:13
  • Basically the communication cost of the 1D decomposition will be *worse than a 2D decomposition*. The communication overhead depends on the balance of image size (larger=better), number of ranks (less=better), and computation per iteration (more=better). – Zulan Mar 17 '17 at 14:07
  • I agree with that part, 1d communication will involve lots of exchanges. Thanks for all your help. I'll go ahead and accept the answer soon – Ankit Kulshrestha Mar 17 '17 at 17:53