MPI - no speedup with increasing amounts of processes

Question

I'm writing program for testing whether numbers are prime. At the beginning I calculate how much numbers assign to each process, then send this amount to the processes. Next, calculations are performed and data send back to process 0 that save the results. Below code works but when I increase number of process my program doesn't speedup. It seems to me that my program doesn't work in parallel. What's wrong? This is my first program in MPI so any advices are welcome.

I use mpich2 an I test my program on Intel Core i7-950.

main.cpp:

if (rank == 0) {
    int workers = (size-1);
    readFromFile(path);
    int elements_per_proc = (N + (workers-1)) / workers;
    int rest = N % elements_per_proc;

    for (int i=1; i <= workers; i++) {
        if((i == workers) && (rest != 0))
            MPI_Send(&rest, 1, MPI_INT, i, 0, MPI_COMM_WORLD);
        else
            MPI_Send(&elements_per_proc, 1, MPI_INT, i, 0, MPI_COMM_WORLD);
    }

    int it = 1;
    for (int i=0; i < N; i++) {
        if((i != 0) && ((i % elements_per_proc) == 0))
        it++;
        MPI_Isend(&input[i], 1, MPI_INT, it, 0, MPI_COMM_WORLD, &send_request);
    }
}

if (rank != 0) {
    int count;
    MPI_Recv(&count, 1, MPI_INT, 0, MPI_ANY_TAG, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
    for (int j=0; j < count; j++) {
        MPI_Recv(&number, 1, MPI_INT, 0, MPI_ANY_TAG, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
        result = test(number, k);
        send_array[0] = number;
        send_array[1] = result;
        MPI_Send(send_array, 2, MPI_INT, 0, 0, MPI_COMM_WORLD);
    }
}   

if (rank == 0) {
    for (int i=0; i < N; i++) {
        MPI_Recv(rec_array, 2, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
        //  save results
    }
}

@Dieter Lücking, I use Miller-Rabin primality test. Each process tests part of numbers and these operations are independent. — Bakus123, May 28 '15 at 17:28

score 3 · Accepted Answer · answered May 28 '15 at 18:15

Your implementation probably doesn't scale well to many processes, since you communicate in every step. You currently communicate the numbers and results for each single input, which incurs a large latency overhead. Instead you should think about communicating the input in-bulk (ie, using a single message).

Furthermore, using MPI collective operations (MPI_Scatter/MPI_Gather) instead of loops of MPI_Send/MPI_Recv might increase your performance further.

Additionally, you can utilize the master process to work on a chunk of the input as well.

A much more scalable implementation might then look as follows:

// tell everybody how many elements there are in total
MPI_Bcast(&N, 1, MPI_INT, 0, MPI_COMM_WORLD);

// everybody determines how many elements it will work on
// (include the master process)
int num_local_elements = N / size + (N % size < rank ? 1 : 0);
// allocate local size
int* local_input = (int*) malloc(sizeof(int)*num_local_elements);

// distribute the input from master to everybody using MPI_Scatterv
int* counts; int* displs;
if (rank == 0) {
    counts = (int*)malloc(sizeof(int) * size);
    displs = (int*)malloc(sizeof(int) * size);
    for (int i = 0; i < size; i++) {
        counts[i] = N / size + (N % size < i ? 1 : 0);
        if (i > 0)
            displs[i] = displs[i-1] + counts[i-1];
    }
    // scatter from master
    MPI_Scatterv(input, counts, displs, MPI_INT, local_input, num_local_elements, MPI_INT, 0, MPI_COMM_WORLD);
} else {
    // receive scattered numbers
    MPI_Scatterv(NULL, NULL, NULL, MPI_DATATYPE_NULL, local_input, num_local_elements, MPI_INT, 0, MPI_COMM_WORLD);
}

// perform prime testing
int* local_results = (int*) malloc(sizeof(int)*num_local_elements);
for (int i = 0; i < num_local_elements; ++i) {
    local_results[i] = test(local_input[i], k);
}

// gather results back to master process
int* results;
if (rank == 0) {
    results = (int*)malloc(sizeof(int)*N);
    MPI_Gatherv(local_results, num_local_elements, MPI_INT, results, counts, displs, MPI_INT, 0, MPI_COMM_WORLD);
    // TODO: save results on master process
} else {
    MPI_Gatherv(local_results, num_local_elements, MPI_INT, NULL, NULL, NULL, MPI_INT, 0, MPI_COMM_WORLD);
}

thank you very much, your implementation is great. I thought about MPI_Scatterv & MPI_Gatherv but in this case data are sent/received by all process (also rank 0). I want to do so rank 0 don't participate in the calculation only broadcast data and save results. What can I do to increase my implementation? — Bakus123, May 28 '15 at 18:48
Two things you can do: 1.) send problem and receive results in-bulk. Meaning you have two communication phases. Avoid sending every single number/result by itself. 2.) Edit the scatterv/gatherv such that the count for process `0` is `0` and the problem is distributed among the `(size-1)` instead of `size` processes. — Patrick, May 28 '15 at 18:51

MPI - no speedup with increasing amounts of processes

1 Answers1

Linked