I am working on a problem where I need to do matrix multiplication in MPI by slicing columns of one of the matrix to different processors. A*B=C, B is to be sliced
I do the following:
MPI_Bcast(A, n*n, MPI_DOUBLE, 0, MPI_COMM_WORLD);
where A is allocated to all ranks
MPI_Type_vector(n, n/p, n, MPI_DOUBLE, &tmp_type);
MPI_Type_create_resized(tmp_type, 0, n/p*sizeof(double), &col_type);
MPI_Type_commit(&col_type);
where n-size of A & B and p-no. of processors
MPI_Scatter(B, 1, col_type, b, n/p*n, MPI_DOUBLE, 0, MPI_COMM_WORLD);
where B is allocated only on root and b is allocated on all ranks
cblas_dgemm( CblasRowMajor, CblasNoTrans, CblasNoTrans, n, n/p, n, 1.0, A, n, b, n/p, 0.0, c, n/p );
where c allocated on all ranks
(Multiplication on each processor done by BLAS routine)
MPI_Gather(c, n/p*n, MPI_DOUBLE, C, 1, col_type, 0, MPI_COMM_WORLD);
where C is allocated only on root
My code does do the required stuff for small matrices (sizes<62). But it fails for matrices bigger than this and gives the below error:
[csicluster01:12280] *** An error occurred in MPI_Gather
[csicluster01:12280] *** on communicator MPI_COMM_WORLD
[csicluster01:12280] *** MPI_ERR_TRUNCATE: message truncated
[csicluster01:12280] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 12280 on
node csicluster01 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
Are there any obvious errors here that I have not been able to figure out? Or is it possible that the problem might be because of some issue with the processors being used?