Seems like MPI_Barrier(MPI_COMM_WORLD) doesn't do its job

Question

I don't give all the code since I'm trying to make this part as readible as possible and the rest of the program is not so essential. E is a unit matrix, A is an invertible matrix. Function Solve(...) is called from int main() and transforms both E and A. As a result we should get inverse of A in E and unit matrix in A.

When I run this program with 1 process, it works excellently. When I add another process, the resulting matrices are far from what's needed. It looks like actions are done in wrong order. I've read (for example, here OpenMPI MPI_Barrier problems) that MPI works messy with the output, so adding some output between actions doesn't help to find out what's happening. Plus, since the program works fine for one process, the algorithm is correct, the whole problem is coming from MPI and it shouldn't have appeared if MPI_Barrier(MPI_COMM_WORLD) had done its job properly.

Does MPI_Barrier(MPI_COMM_WORLD) synchronize processes here? Why does the program work fine for 1 process and fails for 2?

Thank you.

struct Matrix{
    vector<vector<double>> mat;
    size_t m; //rows
    size_t n; //columns
};

void Solve(Matrix &A, Matrix &E, size_t rank, size_t size)
{
    size_t n=A.m;
    size_t ind;

    for(size_t k=0; k<n; k++)
    {
        if(rank==0) // initial action are executed by 0th process
        {
            ind=A.MainElement_Row(k,k);
            E.SwapColumns(k,ind); 
            A.SwapColumns(k,ind);
            E.MultColumn(k,1/A.mat[k][k]); 
            A.MultColumn(k,1/A.mat[k][k]);
        }
        MPI_Barrier(MPI_COMM_WORLD); // I don't want other processes to go further until 0th did its work

        for(size_t i=0; i<n; i++)
            MPI_Bcast(&E.mat[i][0], n, MPI_DOUBLE, 0, MPI_COMM_WORLD);

        for(size_t i=0; i<n; i++)
            MPI_Bcast(&A.mat[i][0], n, MPI_DOUBLE, 0, MPI_COMM_WORLD); // now all processes know how the matrices look

        MPI_Barrier(MPI_COMM_WORLD); // Me making sure they all got it

        for(size_t i=rank; i<n; i=i+size) // here parallel programming starts. Every process works with its share of columns. i-th process is responsible for changing i-th, (i+size)th, (i+2*size)th etc. columns. Processes' areas of work don't intersect
        {
            if(i!=k)
            {
                E.SubtractColumns(i,k,A.mat[k][i]); 
                A.SubtractColumns(i,k,A.mat[k][i]); 
            }
        }

        MPI_Barrier(MPI_COMM_WORLD); // Let's make sure all processes did their job

        for(size_t i=0; i<size; i++)
        {
            for(size_t j=0; j<n; j++)
                MPI_Bcast(&E.mat[j][0], n, MPI_DOUBLE, i, MPI_COMM_WORLD);

            for(size_t j=0; j<n; j++)
                MPI_Bcast(&A.mat[j][0], n, MPI_DOUBLE, i, MPI_COMM_WORLD); // May be it's not the best choice of function, but anyway now all processes broadcasted what they did to the matrix
        }
        MPI_Barrier(MPI_COMM_WORLD); // we make sure, they all understood how the matrices look now
    }
}

(Edit, improved version of code)

void SendColumn(Matrix &A, Matrix &E, size_t root, size_t n, size_t num)
{
    for(size_t m=0; m<n; m++)
    {
        MPI_Bcast(&E.mat[m][num], 1, MPI_DOUBLE, root, MPI_COMM_WORLD);
        MPI_Bcast(&A.mat[m][num], 1, MPI_DOUBLE, root, MPI_COMM_WORLD);
    }
}

void Solve(Matrix &A, Matrix &E, size_t rank, size_t size)
{
    size_t n=A.m;
    size_t ind;

    for(size_t k=0; k<n; k++)
    {
        if(rank==0)
        {
            ind=A.MainElement_Row(k,k);
            E.SwapColumns(k,ind);
            A.SwapColumns(k,ind);
            E.MultColumn(k,1/A.mat[k][k]);
            A.MultColumn(k,1/A.mat[k][k]);
        }
        MPI_Barrier(MPI_COMM_WORLD);

        SendColumn(A,E,0,n,k);
        SendColumn(A,E,0,n,ind);

        MPI_Barrier(MPI_COMM_WORLD);

        for(size_t i=rank; i<n; i=i+size)
        {
            if(i!=k) 
            {
                E.SubtractColumns(i,k,A.mat[k][i]);
                A.SubtractColumns(i,k,A.mat[k][i]);
            }
        }

        MPI_Barrier(MPI_COMM_WORLD);

        for(size_t root=0; root<size; root++)
        {
            for(size_t i=root; i<n; i=i+size)
            {
                if(i!=k)
                    SendColumn(A,E,root,n,i);
            }
        }
        MPI_Barrier(MPI_COMM_WORLD);
    } 
}

Plus, since the program works fine for one process, the algorithm is correct, the whole problem is coming from MPI and it shouldn't have appeared if MPI_Barrier(MPI_COMM_WORLD) had done its job properly. this is incorrect on too many levels! Manually run the program with 2 MPI tasks and a 2x2 matrix, and you should find that your (parallel) algorithm is wrong regardless MPI_Barrier() is working or not (I bet it does, even if it is useless here) — Gilles Gouaillardet, Aug 15 '20 at 04:57
@GillesGouaillardet I just did that on a piece of paper for matrix {{1,2}, {2,2}} and, imho, it works. Process number 0 subtracts 1st column from 0th and process number 1 subtracts 0th column from 1st. In this example, 1st process must do its job earlier than 0th process, — Haldot, Aug 15 '20 at 11:24
@GillesGouaillardet Could you give a 2x2 matrix for which, in your opinion, algorithm fails? — Haldot, Aug 15 '20 at 11:47
If your algorithm only works if one task happens to finish before an other, then this is a textbook race condition and your algorithm is flawed. Anyway, I strongly doubt you manually run beyond column subtractions (e.b. broadcasts). — Gilles Gouaillardet, Aug 15 '20 at 12:23
I don't get you. 1) The reason I use MPI_Barrier is to avoid race condition. Actually, that is what my original question is about - I tried to fix order of processes' actions everywhere, where it could affect the result, however it seems that the processes don't synchronize and the result is affected. 2) Could you point to the exact place which causes your strong doubts? I still see nothing wrong with column subtractions: I have n columns and I need to subtract one of them from all the others, so I divide this job between processes. There are no 2 processes which change the same column. — Haldot, Aug 15 '20 at 12:39
A barrier is a point of synchronization, it cannot be used to order tasks. Since you did not post the code involved in column subtraction, I am going to assume it is correct, but what you do right after looks fishy. — Gilles Gouaillardet, Aug 15 '20 at 13:12
You mean MPI_Cast() in a for loop? I do that in order to broadcast the changed matrices to all processes, so that after increment of k all processes start to work with the same matrices. Is there a problem with it which can affect the result? I agree that so far this part of code doesn't look nice and I will work on it after this program gives the right inverse matrix and I know that ideologically the program is correct. — Haldot, Aug 15 '20 at 13:33
yep, this `for` loop is logically equivalent to broadcasting the matrices once with `root=0`. — Gilles Gouaillardet, Aug 15 '20 at 13:41
manually run it with a 1x1 matrix and 2 MPI tasks and see what happens... — Gilles Gouaillardet, Aug 15 '20 at 13:49
Ok, I realized what you were talking about. Thanks! Indeed after first iteration of `for` I was broadcasting the same matrices over again. I got rid of that loop and updated code in the question body. The problem is, now I'm getting segmentation fault and it's coming somewhere from the new `for` loop, where I call `SendColumn`. Any ideas what can be the reason? — Haldot, Aug 15 '20 at 22:17
at first glance, I did not spot an error from the snippets. please trim your code down to a [mcve] if you need my help. I also strongly encourage you to allocate your matrices in contiguous memory, use a derived datatype to communicate a column in one shot, and check whether `MPI_Allgather()` (or `MPI_Allgatherv()`) would be a better fit here. — Gilles Gouaillardet, Aug 16 '20 at 02:43
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/219927/discussion-between-haldot-and-gilles-gouaillardet). — Haldot, Aug 16 '20 at 23:11

Seems like MPI_Barrier(MPI_COMM_WORLD) doesn't do its job

0 Answers0