2

I am using python with mpi4py to run parallel code on a computing cluster. I am getting the following error:

Assertion failed in file src/mpid/ch3/channels/mrail/src/rdma/ch3_rndvtransfer.$
[cli_15]: aborting job:
internal ABORT - process 15

I put in print statements to see where it is happening, and it occurs when I try to broadcast a large matrix (14x14x217) from 1 process to another of the total of 32 processes. The code works great when I run tests resulting in a smaller matrix, 14x14x61. Here are the relevant parts of the code (error occurs during the comm.Bcast):

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

...

recv_buffer=numpy.zeros((g.numbands,g.numbands,g.gridsize),'complex')
senddata=numpy.zeros((g.numbands,g.numbands,g.gridsize+6),'complex')
if rank==size-1:
    g.updateVHF(rank,size)  #perform calculation on part of data
    for i in range(size-1):
        comm.Recv(recv_buffer,source=i,tag=0)
        g.VHartree=g.VHartree+recv_buffer[:]

        comm.Recv(recv_buffer,source=i,tag=1)
        g.VFock=g.VFock+recv_buffer[:]

        g.updateBasis()
        senddata[:,:,0:g.gridsize]=g.wf
        senddata[:,:,g.gridsize::]=g.wf0

else:
    g.updateVHF(rank,size)  # perform calculation on part of data
    comm.Send(g.VHartree,dest=size-1,tag=0)
    comm.Send(g.VFock,dest=size-1,tag=1)

comm.Bcast(senddata,root=size-1)  # broadcast to everyone else
if rank != size-1:  # rank==size-1 already has updated values
    g.wf=senddata[:,:,0:g.gridsize]
    g.wf0=senddata[:,:,g.gridsize::]

I found the following: http://listarc.com/showthread.php?4387119-Assertion+failure and mpi4py hangs when trying to send large data, which suggest that there is some size limit to the data that can be sent between processes. Am I correct in thinking that my error due to reaching some limit in the size of data transfer? If so, why does it occur only during the Bcast and not the other Send/Recv communications, since the matrices involved are almost the same size?

Community
  • 1
  • 1
amd
  • 1,697
  • 2
  • 13
  • 19

0 Answers0