-1

I am working on a problem where I need to do matrix multiplication in MPI by slicing columns of one of the matrix to different processors. A*B=C, B is to be sliced

I do the following:

MPI_Bcast(A, n*n, MPI_DOUBLE, 0, MPI_COMM_WORLD); where A is allocated to all ranks

MPI_Type_vector(n, n/p, n, MPI_DOUBLE, &tmp_type);
MPI_Type_create_resized(tmp_type, 0, n/p*sizeof(double), &col_type);
MPI_Type_commit(&col_type);

where n-size of A & B and p-no. of processors

MPI_Scatter(B, 1, col_type, b, n/p*n, MPI_DOUBLE, 0, MPI_COMM_WORLD); where B is allocated only on root and b is allocated on all ranks

cblas_dgemm( CblasRowMajor, CblasNoTrans, CblasNoTrans, n, n/p, n, 1.0, A, n, b, n/p, 0.0, c, n/p ); where c allocated on all ranks (Multiplication on each processor done by BLAS routine)

MPI_Gather(c, n/p*n, MPI_DOUBLE, C, 1, col_type, 0, MPI_COMM_WORLD); where C is allocated only on root

My code does do the required stuff for small matrices (sizes<62). But it fails for matrices bigger than this and gives the below error:

[csicluster01:12280] *** An error occurred in MPI_Gather
[csicluster01:12280] *** on communicator MPI_COMM_WORLD
[csicluster01:12280] *** MPI_ERR_TRUNCATE: message truncated
[csicluster01:12280] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 12280 on
node csicluster01 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------

Are there any obvious errors here that I have not been able to figure out? Or is it possible that the problem might be because of some issue with the processors being used?

akshat
  • 1
  • 1
  • I guess you are either in the same class, or are the same person that already asked this: http://stackoverflow.com/q/23974767/681865 – talonmies Jun 01 '14 at 11:16
  • well... he is my classmate but he couldn't specify the problem clearly, i think i have made the issue more clear – akshat Jun 01 '14 at 11:22
  • possible duplicate of [Sending columns of a matrix using MPI\_Scatter](http://stackoverflow.com/questions/10788180/sending-columns-of-a-matrix-using-mpi-scatter) – Jonathan Dursi Jun 02 '14 at 11:43
  • i've already gone through that post, its not a duplicate because here i mention a specific error that i get in some particular cases. this error has not been discussed anywhere. please answer by question if you have any clue as to what MPI may be doing. – akshat Jun 03 '14 at 11:49

1 Answers1

0

The problem may come from

MPI_Type_create_resized(tmp_type, 0, n/p*sizeof(double), &col_type);

You may need to change it for

MPI_Type_create_resized(tmp_type, 0, n/p*n*sizeof(double), &col_type);

This change would seem logical since you perform things like

MPI_Scatter(B, 1, col_type, b, n/p*n, MPI_DOUBLE, 0, MPI_COMM_WORLD);

But i don't know if this change solves your issue of column partitionning. It might solve the issue of MPI_Gather() triggering error.

Bye,

francis
  • 9,525
  • 2
  • 25
  • 41