How to share work roughly evenly between processes in MPI despite the array_size not being cleanly divisible by the number of processes?

Question

Hi all, I have an array of length N, and I'd like to divide it as best as possible between 'size' processors. N/size has a remainder, e.g. 1000 array elements divided by 7 processes, or 14 processes by 3 processes.

I'm aware of at least a couple of ways of work sharing in MPI, such as:

for (i=rank; i<N;i+=size){ a[i] = DO_SOME_WORK }

However, this does not divide the array into contiguous chunks, which I'd like to do as I believe is faster for IO reasons.

Another one I'm aware of is:

int count = N / size;
int start = rank * count;
int stop = start + count;

// now perform the loop
int nloops = 0;

for (int i=start; i<stop; ++i)
{
    a[i] = DO_SOME_WORK;
}

However, with this method, for my first example we get 1000/7 = 142 = count. And so the last rank starts at 852 and ends at 994. The last 6 lines are ignored.

Would be best solution to append something like this to the previous code?

int remainder = N%size;
int start = N-remainder; 
if (rank == 0){
     for (i=start;i<N;i++){
         a[i] = DO_SOME_WORK;
     }

This seems messy, and if its the best solution I'm surprised I haven't seen it elsewhere.

Thanks for any help!

Alexander Pozdneev · Answer 1 · 2016-12-04T15:30:34.287

If I had N tasks (e.g., array elements) and size workers (e.g., MPI ranks), I would go as follows:

int count = N / size;
int remainder = N % size;
int start, stop;

if (rank < remainder) {
    // The first 'remainder' ranks get 'count + 1' tasks each
    start = rank * (count + 1);
    stop = start + count;
} else {
    // The remaining 'size - remainder' ranks get 'count' task each
    start = rank * count + remainder;
    stop = start + (count - 1);
}

for (int i = start; i <= stop; ++i) { a[i] = DO_SOME_WORK(); }

That is how it works:

/*
  # ranks:                    remainder                     size - remainder
            /------------------------------------\ /-----------------------------\
     rank:      0         1             remainder-1                         size-1
           +---------+---------+-......-+---------+-------+-------+-.....-+-------+
    tasks: | count+1 | count+1 | ...... | count+1 | count | count | ..... | count |
           +---------+---------+-......-+---------+-------+-------+-.....-+-------+
                      ^       ^                            ^     ^
                      |       |                            |     |
   task #:  rank * (count+1)  |        rank * count + remainder  |
                              |                                  |
   task #:  rank * (count+1) + count   rank * count + remainder + count - 1

            \------------------------------------/ 
  # tasks:       remainder * count + remainder
*/

score 5 · Answer 2 · answered Feb 25 '17 at 09:16

5

Here's a closed-form solution.

Let N = array length and P = number of processors.

From j = 0 to P-1,

Starting point of array on processor j = floor(N * j / P)

Length of array on processor j = floor(N * (j + 1) / P) – floor(N * j / P)

answered Feb 25 '17 at 09:16

uohzxela

613
4
14
28

Rob Latham · Answer 3 · 2015-04-21T19:03:25.197

Consider your "1000 steps and 7 processes" example.

simple division won't work because integer division (in C) gives you the floor, and you are left with some remainder: i.e. 1000 / 7 is 142, and there will be 6 doodads hanging out
ceiling division has the opposite problem: ceil(1000/7) is 143, but then the last processor overruns the array, or ends up with less to do than the others.

You are asking for a scheme to evenly distribute the remainder over processors. Some processes should have 142, others 143. There must be a more formal approach but considering the attention this question's gotten in the last six months maybe not.

Here's my approach. Every process needs to do this algorithm, and just pick out the answer it needs for itself.

#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>

int main (int argc, char ** argv)
{
#define NR_ITEMS 1000
    int i, rank, nprocs;;
    int *bins;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
    bins = calloc(nprocs, sizeof(int));

    int nr_alloced = 0;
    for (i=0; i<nprocs; i++) {
        remainder = NR_ITEMS - nr_alloced;
        buckets = (nprocs - i);
        /* if you want the "big" buckets up front, do ceiling division */
        bins[i] = remainder / buckets;
        nr_alloced += bins[i];
    }

    if (rank == 0)
        for (i=0; i<nprocs; i++) printf("%d ", bins[i]);

    MPI_Finalize();
    return 0;
}

The complexity is O(nprocs) in memory and time. Be careful, as that might hurt memory and performance for very large number of processes. — Alexander Pozdneev, Dec 04 '16 at 19:21

score 1 · Answer 4 · answered Apr 20 '16 at 15:04

I know this is long sense gone but a simple way to do this is to give each process the floor of the (number of items) / (number of processes) + (1 if process_num < num_items mod num_procs). In python, an array with work counts:

# Number of items
NI=128
# Number of processes
NP=20

# Items per process
[NI/NP + (1 if P < NI%NP else 0)for P in range(0,NP)]

score 1 · Answer 5 · answered Apr 17 '21 at 05:42

Improving off of @Alexander's answer: make use of min to condense the logic.

int count = N / size;
int remainder = N % size;
int start = rank * count + min(rank, remainder);
int stop = (rank + 1) * count + min(rank + 1, remainder);

for (int i = start; i < stop; ++i) { a[i] = DO_SOME_WORK(); }

score 0 · Answer 6 · answered Mar 27 '13 at 17:03

I think that the best solution is to write yourself a little function for splitting work across processes evenly enough. Here's some pseudo-code, I'm sure you can write C (is that C in your question ?) better than I can.

function split_evenly_enough(num_steps, num_processes)
    return = repmat(0, num_processes)  ! pseudo-Matlab for an array of num_processes 0s
    steps_per_process = ceiling(num_steps/num_processes)
    return = steps_per_process - 1 ! set all elements of the return vector to this number
    return(1:mod(num_steps, num_processes)) = steps_per_process  ! some processes have 1 more step
end

user76284 · Answer 7 · 2015-09-12T20:28:36.973

How about this?

int* distribute(int total, int processes) {
    int* distribution = new int[processes];
    int last = processes - 1;        

    int remaining = total;
    int process = 0;

    while (remaining != 0) {
        ++distribution[process];
        --remaining;

        if (process != last) {
            ++process;
        }
        else {
            process = 0;
        }
    }

    return distribution;
}

The idea is that you assign an element to the first process, then an element to the second process, then an element to the third process, and so on, jumping back to the first process whenever the last one is reached.

This method works even when the number of processes is greater than the number of elements. It uses only very simple operations and should therefore be very fast.

Armut · Answer 8 · 2018-07-26T06:18:17.470

I had a similar problem, and here is my non optimum solution with Python and mpi4py API. An optimum solution would take into account how the processors are laid out, here extra work is ditributed to lower ranks. The uneven workload only differ by one task, so it should not be a big deal in general.

from mpi4py import MPI
import sys
def get_start_end(comm,N):
    """
    Distribute N consecutive things (rows of a matrix , blocks of a 1D array)
    as evenly as possible over a given communicator.
    Uneven workload (differs by 1 at most) is on the initial ranks.

    Parameters
    ----------
    comm: MPI communicator
    N:  int
    Total number of things to be distributed.

    Returns
    ----------
    rstart: index of first local row
    rend: 1 + index of last row

    Notes
    ----------
    Index is zero based.
    """

    P      = comm.size
    rank   = comm.rank
    rstart = 0
    rend   = N
    if P >= N:
        if rank < N:
            rstart = rank
            rend   = rank + 1
        else:
            rstart = 0
            rend   = 0
    else:
        n = N//P # Integer division PEP-238
        remainder = N%P
        rstart    = n * rank
        rend      = n * (rank+1)
        if remainder:
            if rank >= remainder:
                rstart += remainder
                rend   += remainder
            else:
                rstart += rank
                rend   += rank + 1
    return rstart, rend

if __name__ == '__main__':
    comm = MPI.COMM_WORLD
    n = int(sys.argv[1])
    print(comm.rank,get_start_end(comm,n))

How to share work roughly evenly between processes in MPI despite the array_size not being cleanly divisible by the number of processes?

8 Answers8

Linked