Variable 2D Array delivered to a variable number of nodes

Question

I have to face this situation:

given N number of MPI nodes and given a 2D real array of [N_ROWS,N_COLS] dimension

I have to partition it into in order to speed up calculus, giving to each node a subsection of 2D array and taking advantage of number of nodes.

Following Fortran way to store data in memory, arrays are indexed using the most rapidly changing variable first, every [:,i]-column of the array is "logically" separated from the others.

I have looked around to very illuminating questions like this one Sending 2D arrays in Fortran with MPI_Gather

And I have reached the idea of using mpi_scatterv and mpi_gatherv, BUT I'm stuck against the fact that, since in the problem constraints, there is no possibility to guarantee that for each MPI node it is given the same amount of data, or, in pseudo code:

#Number_of_MPI_nodes != N_ROWS*N_COLS

I was looking to use vectors, since each "column" has is own "independent" series of data, when I say "independent" I mean that I have to do some manipulation on the data belonging the same column, without affecting other columns.

Obviously, since the inequality given, some MPI nodes will have a different number of "columns" to analyze.

After doing some math, I need to gather back the data, using mpi_gatherv

I will update the question with a working example in a few hours!

Thanks a lot to everybody !

CODE:

program main

use mpi

implicit none

integer:: N_COLS=100, N_ROWS=200
integer:: i, j
integer:: ID_mpi, COM_mpi, ERROR_mpi
integer:: master = 0, SIZE_mpi=0 
integer:: to_each_cpu=0, to_each_cpu_oddment=0
integer:: sub_matrix_size=0 
integer:: nans=0, infs=0, array_split =0, my_type=0

integer ,dimension(:), allocatable :: elem_to_each_cpu
integer ,dimension(:), allocatable :: displacements
integer,parameter:: seed = 12345

character*160:: message

real :: tot_sum = 0.0

real ,dimension(:,:), allocatable:: Data_Matrix
real ,dimension(:,:), allocatable:: sub_split_Data_Matrix

call srand(seed)    
call MPI_INIT(ERROR_mpi)
COM_mpi = MPI_COMM_WORLD
call MPI_COMM_RANK(COM_mpi,ID_mpi,ERROR_mpi)
call MPI_COMM_SIZE(COM_mpi,SIZE_mpi,ERROR_mpi)

!! allocation Data_Matrix 
i = 1; j = 1
if (ID_mpi .eq. master) then
    i = N_ROWS; j = N_COLS
end if
allocate(Data_Matrix(i, j))

do j = 1, N_COLS
    do i = 1, N_ROWS
        Data_Matrix(i, j) = rand()
        tot_sum = tot_sum + Data_Matrix(i, j)
    enddo
enddo

write(message,*) "N_COLS:",N_COLS, "N_ROWS:", N_ROWS, " TOTAL_SUM:", tot_sum
write(*,*) message

!! SINCE THERE ARE NO RESTRICTIONS ON MPI NUMBER OR CPUS OR 
!! SIZE OR Data_Matrix I NEED TO DO THIS 
to_each_cpu =N_COLS / SIZE_mpi
to_each_cpu_oddment = N_COLS -( to_each_cpu * SIZE_mpi )

allocate(elem_to_each_cpu(SIZE_mpi))
elem_to_each_cpu = to_each_cpu
allocate(displacements(SIZE_mpi))
displacements = 0

!! I CHOOSE TO SPLIT THE DATA IN THIS WAY
if (ID_mpi .eq. master) then

    write(message,*) "N_COLS:",N_COLS, "mpisize:", SIZE_mpi, "to_each_cpu\oddment:", to_each_cpu, " \ ", to_each_cpu_oddment
    write(*,*) message

    j=1
    do i = 1 , to_each_cpu_oddment
        elem_to_each_cpu(j) = elem_to_each_cpu(j) + 1
        j = j + 1
        if(j .gt. SIZE_mpi) j = 1
    enddo

    do j = 2, SIZE_mpi
        displacements(j) = elem_to_each_cpu(j-1) + displacements(j-1)
    enddo

    do i = 1 , SIZE_mpi
        write(message,*)i, " to_each_cpu:", &
        elem_to_each_cpu(i), " sub_split_buff_displ:",displacements(i), "=",elem_to_each_cpu(i)+displacements(i)
        write(*,*) message
    enddo

end if

call MPI_BCAST(elem_to_each_cpu, SIZE_mpi, MPI_INT, 0, COM_mpi, ERROR_mpi)
call MPI_BCAST(displacements, SIZE_mpi, MPI_INT, 0, COM_mpi, ERROR_mpi)

allocate( sub_split_Data_Matrix(N_ROWS,elem_to_each_cpu(ID_mpi+1)) )

call MPI_TYPE_VECTOR(N_COLS,N_ROWS,N_ROWS,MPI_FLOAT,my_type,ERROR_mpi) 
call MPI_TYPE_COMMIT(my_type, ERROR_mpi) 

sub_split_Data_Matrix=0
sub_matrix_size = N_ROWS*elem_to_each_cpu(ID_mpi+1)

call MPI_scatterv( Data_Matrix,elem_to_each_cpu,displacements,&
    MPI_FLOAT, sub_split_Data_Matrix, sub_matrix_size ,MPI_FLOAT, &
    0, COM_mpi, ERROR_mpi)

!!! DOING SOME MATH ON SCATTERED MATRIX 

call MPI_gatherv(&
    sub_split_Data_Matrix, sub_matrix_size,MPI_FLOAT ,&
    Data_Matrix, elem_to_each_cpu, displacements, &
    MPI_FLOAT, 0, COM_mpi, ERROR_mpi)

!!! DOING SOME MATH ON GATHERED MATRIX 
tot_sum = 0.0
do j = 1, N_COLS
    do i = 1, N_ROWS
        tot_sum = tot_sum + Data_Matrix(i, j)
    enddo
enddo

write(message,*) "N_COLS:",N_COLS, "N_ROWS:", N_ROWS, " TOTAL_SUM:", tot_sum
write(*,*) message


deallocate(Data_Matrix)

if (ID_mpi .eq. master) then
    deallocate(elem_to_each_cpu )
    deallocate(displacements )
endif

deallocate(sub_split_Data_Matrix)

end

RESULT:

Error occurred in MPI_Gahterv on communicator MPI_COMM_WORLD

Invalid memory reference

QUESTION:

Can you help me find the error ? Or better, can you help me in showing if the approach that I used was appropriate ?

Thanks a lot!

So what is your question? I see only an analysis of your situation. To get help, it is better to clearly state question. — innoSPG, Sep 30 '15 at 14:27
What most MPI users do (I assert, and others will pile in with disagreement if they wish) is accept some moderate level of load imbalance. So dividing a `15x15` grid across 4 processes might leave process 0 with an `8x8` sub-grid, process 1 with `7x8`, process 2 with `8x7` and process 3 with `7x7`. The additional complexity of trying to equalise the workloads of processes 0 and 4 is generally not worth while. — High Performance Mark, Sep 30 '15 at 14:50
Generally speaking certain level of imbalance is invariant in terms of time to complete the task. But my question now is to know if I did it good using mph_vector or it was better to use mph_sub_array — pm2r, Oct 04 '15 at 22:03

score 1 · Accepted Answer · answered Oct 05 '15 at 06:45

I had a look at your code and did some changes to fix it:

Unimportant: a few stylistic / cosmetic elements here and there to (from my standpoint and that is arguable) improve readability. Sorry if you don't like it.
There is no need for the process 0 to be the only one computing the lengths and displacements for the MPI_Scatterv()/MPI_Gatherv() calls. All processes should compute them since they all have the necessary data to do so. Moreover, it spares you two MPI_Bcast() which is good.
The lengths were strangely computed. I suspect it was wrong but I'm not sure since it was so convoluted I just rewrote it.
The main issue was a mix-up between the vector type and the scalar type: your lengths and displacements were computed for your vector type, but you were calling MPI_Scatterv()/MPI_Gatherv() with the scalar type. Moreover, for Fortran, this scalar type is MPI_REAL, not MPI_FLOAT. In the code I posted here-below, I computed lengths and displacements for MPI_REAL, but if you prefer, you can divide them all by N_ROWS and use the result of MPI_Type_contiguous( N_ROWS, MPI_REAL, my_type ) instead of MPI_REAL in the scatter/gather, and get the same result.

Here is the modified code:

program main
    use mpi
    implicit none

    integer, parameter :: N_COLS=100, N_ROWS=200, master=0
    integer :: i, j
    integer :: ID_mpi,SIZE_mpi, COM_mpi, ERROR_mpi, my_type
    integer :: to_each_cpu, to_each_cpu_oddment, sub_matrix_size 
    integer, allocatable :: elem_to_each_cpu(:), displacements(:)
    real :: tot_sum = 0.0
    real, allocatable :: Data_Matrix(:,:), sub_split_Data_Matrix(:,:)

    call MPI_Init( ERROR_mpi )
    COM_mpi = MPI_COMM_WORLD
    call MPI_Comm_rank( COM_mpi, ID_mpi, ERROR_mpi )
    call MPI_Comm_size( COM_mpi, SIZE_mpi, ERROR_mpi )

    !! allocation Data_Matrix 
    if ( ID_mpi == master ) then
        allocate( Data_Matrix( N_ROWS, N_COLS ) )
        call random_number( Data_Matrix )
        do j = 1, N_COLS
            do i = 1, N_ROWS
                tot_sum = tot_sum + Data_Matrix(i, j)
            enddo
        enddo
        print *, "N_COLS:", N_COLS, "N_ROWS:", N_ROWS, " TOTAL_SUM:", tot_sum
    end if

    !! SINCE THERE ARE NO RESTRICTIONS ON MPI NUMBER OR CPUS OR 
    !! SIZE OR Data_Matrix I NEED TO DO THIS 
    to_each_cpu = N_COLS / SIZE_mpi
    to_each_cpu_oddment = N_COLS - ( to_each_cpu * SIZE_mpi )

    allocate( elem_to_each_cpu(SIZE_mpi) )
    elem_to_each_cpu = to_each_cpu * N_ROWS
    allocate( displacements(SIZE_mpi) )
    displacements = 0

    !! I CHOOSE TO SPLIT THE DATA IN THIS WAY
    if ( ID_mpi == master ) then
        print *, "N_COLS:", N_COLS, "mpisize:", SIZE_mpi, "to_each_cpu\oddment:", to_each_cpu, " \ ", to_each_cpu_oddment
    end if 

    do i = 1, to_each_cpu_oddment
       elem_to_each_cpu(i) = elem_to_each_cpu(i) + N_ROWS
    enddo

    do i = 1, SIZE_mpi-1
        displacements(i+1) = displacements(i) + elem_to_each_cpu(i)
    enddo

    if ( ID_mpi == master ) then
        do i = 1, SIZE_mpi
            print *, i, " to_each_cpu:", &
                elem_to_each_cpu(i), " sub_split_buff_displ:", displacements(i), &
                "=", elem_to_each_cpu(i) + displacements(i)
        enddo
    end if

    allocate( sub_split_Data_Matrix(N_ROWS, elem_to_each_cpu(ID_mpi+1)/N_ROWS) )

    sub_split_Data_Matrix = 0
    sub_matrix_size = elem_to_each_cpu(ID_mpi+1)

    call MPI_scatterv( Data_Matrix, elem_to_each_cpu ,displacements, MPI_REAL, &
                       sub_split_Data_Matrix, sub_matrix_size, MPI_REAL, &
                       master, COM_mpi, ERROR_mpi )

    !!! DOING SOME MATH ON SCATTERED MATRIX 

    call MPI_gatherv( sub_split_Data_Matrix, sub_matrix_size, MPI_REAL, &
                      Data_Matrix, elem_to_each_cpu, displacements, MPI_REAL, &
                      master, COM_mpi, ERROR_mpi )

    !!! DOING SOME MATH ON GATHERED MATRIX 
    if ( ID_mpi == master ) then
        tot_sum = 0.0
        do j = 1, N_COLS
            do i = 1, N_ROWS
                tot_sum = tot_sum + Data_Matrix(i, j)
            enddo
        enddo

        print *, "N_COLS:", N_COLS, "N_ROWS:", N_ROWS, " TOTAL_SUM:", tot_sum
        deallocate( Data_Matrix )
    endif

    deallocate( elem_to_each_cpu )
    deallocate( displacements )
    deallocate( sub_split_Data_Matrix )

end program main

With these modifications, the code works as expected:

$ mpif90 scat_gath2.f90
$ mpirun -n 3 ./a.out 
 N_COLS:         100 N_ROWS:         200  TOTAL_SUM:   10004.4443    
 N_COLS:         100 mpisize:           3 to_each_cpu\oddment:          33  \            1
           1  to_each_cpu:        6800  sub_split_buff_displ:           0 =        6800
           2  to_each_cpu:        6600  sub_split_buff_displ:        6800 =       13400
           3  to_each_cpu:        6600  sub_split_buff_displ:       13400 =       20000
 N_COLS:         100 N_ROWS:         200  TOTAL_SUM:   10004.4443

Thanks a lot !!! It's not a problem if you edited the style of the code I would like to better understand how to overcome in the future these kind of problems related to how fortran stores in memory objects especially like 2d arrays, or matrix, and how to "calculate", I hope that you can understand in which way I'm using it, the right dimensions in ScatterV and GatherV — pm2r, Oct 05 '15 at 15:04

Variable 2D Array delivered to a variable number of nodes

1 Answers1