A parallel fortran code that solves a set of linear simultaneous equations Ax = b using the scalapack routine PDGESV fails (exiting with segmentation fault) when the no. of equations, N, becomes large. I have not identified the exact value of N at which problems arise, but, for example, the code works perfectly for all the values I have tested up to N= 50000, but fails at N=94423.
In particular, the failure occurs during the call to the scalapack PDGESV routine (i.e. not when allocating / deallocating memory); it enters routine PDGESV, but does not leave this routine.
I am working on a Linux Mint 18.3 Sylvia system with 148 GB memory using a Intel(R) Xeon(R) CPU E5-1660 v4 @ 3.20GHz processor. I am using mpifortran employing gfortran.
I am somewhat confident that there is not a problem with the fortran code itself, as the code works perfectly for every values of N and process configuration I have tried up to N=50000, exiting with the INFO=0 code that indicates no errors have occurred. (I also ran a slightly modified version of the program that explicitly checked the residual for the solution matrix x*, i.e. computed Ax* - b and found, correctly, maximum absolute values close to zero). If there was some problem with the matrix being singular we would of course instead observe an exit from the PDGESV routine with a non-zero INFO code.
The machine's memory also appears to be sufficient ; for the problem case N=94423 we only require 65 GB memory versus the available 148 GB memory, and there is no problem at allocation time (furthermore a serial code solving the same problem, and using 65 GB memory, runs without errors).
My feeling is that instead there is some problem with maybe exceeding some default limit on what memory is available to a single process in mpi? i.e. perhaps I am simply missing some appropriate FLAGS at compilation / run time?
I have tried using the 'ulimit -s unlimited' command , but this did not resolve the problem.
I copy the fortran code below; this is a simple test program that 1) allocates space for the matrix A and vector b, 2) fills their entries with random entries 3) calls PDGESV and then 4) deallocates the memory.
I list the compilation /execution commands (using mpifortran/ gfortran) I used below.
Note I have also tried using the PGI fortran compiler, and observed the same error for the same test case (see error output below).
Fortran code:
PROGRAM SOLVE_LU
USE MPI
IMPLICIT NONE
INTEGER :: N
DOUBLE PRECISION, ALLOCATABLE, DIMENSION(:,:) :: LOCAL_A
DOUBLE PRECISION, ALLOCATABLE, DIMENSION(:) :: LOCAL_B
INTEGER :: ISTATUS
C FOR LAPACK PDGESV CALL
INTEGER :: INFO, NRHS, IA, JA, IB, JB
INTEGER, ALLOCATABLE, DIMENSION (:) :: IPIV
c FOR READING COMMAND LINE ARGUMENTS
INTEGER :: IARGC, N_COMMAND_ARG
CHARACTER :: ARGV*10
C WE USE FOLLOWING COMMAND LINE ARGUMENTS
C ARG 1 : N (DIMENSION OF PROBLEM)
C ARG 2 : NPROW (NO. OF ROWS OF PROCESSES IN A RECTANGULAR ARRAY)
C ARG 3 : NPCOL (NO. OF COLUMNS OF PROCESSES IN A RECTANGULAR ARRAY)
C ARG 4 : BLACS BLOCK SIZE MB (BLOCKS ARE OF SIZE MB * MB)
C
c FOR PARALLEL PROCESS ARRAY
INTEGER :: NPROW, NPCOL, ICTXT,MYROW, MYCOL, MB, NB, MLOC, NLOC
INTEGER :: IDESCA(9), IDESCB(9)
INTEGER :: IERR
INTEGER :: NUMROC
c for random number seed
INTEGER :: ISEEDSIZE
INTEGER, ALLOCATABLE, DIMENSION ( :) :: SEED
C ----------------------------------------
C ------- EXECUTABLE STATEMENTS -------
C ===============================================
C READ IN COMMAND LINE ARGUMENTS IF PRESENT
N_COMMAND_ARG = iargc()
IF (N_COMMAND_ARG == 2) THEN
WRITE(*,*) 'ILLEGAL NO. OF COMMAND LINE PARAMETERS'
STOP
ENDIF
IF (N_COMMAND_ARG .GE. 1)THEN
CALL GETARG(1,argv)
C WRITE(*,*)'ARGV = ',ARGV
READ (ARGV,'(I10)') N
ELSE
N = 100
ENDIF
IF (N_COMMAND_ARG .GE. 3)THEN
CALL GETARG(2,argv)
READ (ARGV,'(I10)') NPROW
CALL GETARG(3,argv)
READ (ARGV,'(I10)') NPCOL
ELSE
NPROW = 2
NPCOL = 2
ENDIF
IF (N_COMMAND_ARG .GE. 4)THEN
CALL GETARG(4,argv)
READ (ARGV,'(I10)') MB
ELSE
MB = 8
ENDIF
NB = MB
C ==============================================
C INITIALISE THE BLACS PROCESS GRID, FIND DIMENSIONS OF LOCAL
C MATRICES / VECTORS AND ALLOCATE SPACE
CALL SL_INIT(ICTXT, NPROW, NPCOL)
CALL BLACS_GRIDINFO( ICTXT, NPROW, NPCOL, MYROW, MYCOL )
MLOC = NUMROC(N, MB, MYROW, 0, NPROW)
NLOC = NUMROC(N, NB, MYCOL, 0, NPCOL)
IF( MYROW.EQ.0 .AND. MYCOL.EQ.0 )WRITE(*,*)
@ 'WE ARE SOLVING A SYSTEM OF ', N, ' LINEAR EQUATIONS'
WRITE(*,*) 'PROC: ',MYROW, MYCOL,'HAS MLOC, NLOC =', MLOC,NLOC
c ==============================================
C ALLOCATE SPACE FOR MATRIX A AND VECTORS B AND X
WRITE(*,*) 'PROC: ',MYROW, MYCOL,' ALLOCATING SPACE ...'
ALLOCATE ( LOCAL_A(MLOC,NLOC), STAT = ISTATUS )
IF(ISTATUS .NE. 0) THEN
WRITE(*,*)'UNABLE TO ALLOCATE LOCAL_A, PROCESS: ',MYROW,MYCOL
STOP
ENDIF
ALLOCATE ( LOCAL_B(MLOC), STAT = ISTATUS )
IF (ISTATUS /= 0) THEN
WRITE(*,*)
@ ' FAILED TO ALLOCATE SPACE FOR LOCAL_B, PROCESS: ',MYROW,MYCOL
STOP
ENDIF
c BLACS DESCRIPTOR FOR A AND ITS COPY
CALL DESCINIT (IDESCA, N, N, MB, NB, 0, 0,
@ ICTXT, MLOC, IERR)
c BLACS DESCRIPTOR FOR B AND SOLN VECTOR X
CALL DESCINIT (IDESCB, N, 1, MB, 1, 0, 0, ICTXT, MLOC, IERR)
c ==============================================
C FILL ENTRIES OF MATRIX A AND R.H.S. VECTOR B WITH RANDOM ENTRIES
WRITE(*,*)'PROC: ',MYROW, MYCOL,
@ ' CONSTRUCTING MATRIX A AND RHS VECTOR B ...'
CALL RANDOM_SEED
CALL RANDOM_SEED ( SIZE = ISEEDSIZE ) ! GET SIZE OF SEED ARRAY
ALLOCATE ( SEED(1:ISEEDSIZE) )
CALL RANDOM_SEED ( GET = SEED )
SEED(1) = SEED(1) + NPCOL*MYROW + MYCOL ! ENSURES DIFFERENT SEED
! FOR EACH PROCESS
CALL RANDOM_SEED ( PUT = SEED )
CALL RANDOM_NUMBER(LOCAL_B)
CALL RANDOM_NUMBER(LOCAL_A)
c ==============================================
C CALL LAPACK LU SOLVER ROUTINE
WRITE(*,*)'PROC: ',MYROW, MYCOL,
@ 'NOW SOLVING SYSTEM AX = B USING SCALAPACK PDGESV ..'
ALLOCATE ( IPIV(MLOC + MB), STAT=ISTATUS )
IF(ISTATUS /= 0) THEN
WRITE(*,*)'UNABLE TO ALLOCATE IPIV, PROCESS: ',MYROW,MYCOL
STOP
ENDIF
IA = 1
JA = 1
IB = 1
JB = 1
NRHS = 1
INFO = 0
CALL PDGESV(N, NRHS, LOCAL_A, IA, JA, IDESCA, IPIV,
@ LOCAL_B, IB, JB, IDESCB, INFO )
IF( MYROW.EQ.0 .AND. MYCOL.EQ.0 ) THEN
WRITE(*,*)
WRITE(*,*) 'INFO code returned by PDGESV = ', INFO
WRITE(*,*)
END IF
c ==============================================
C DEALLOCATE MEMORY
DEALLOCATE(LOCAL_A, STAT=ISTATUS)
IF(ISTATUS /= 0) THEN
WRITE(*,*)'UNABLE TO DEALLOCATE '
STOP
ENDIF
DEALLOCATE(LOCAL_B, STAT=ISTATUS)
IF(ISTATUS /= 0) THEN
WRITE(*,*)'UNABLE TO DEALLOCATE '
STOP
ENDIF
DEALLOCATE(IPIV, STAT=ISTATUS)
IF(ISTATUS /= 0) THEN
WRITE(*,*)'UNABLE TO DEALLOCATE '
STOP
ENDIF
c ===================================================
c RELEASE BLACS CONTEXT
CALL BLACS_GRIDEXIT(ictxt)
CALL BLACS_EXIT(0)
END PROGRAM SOLVE_LU
I compile the above code with : mpifort -Wall -mcmodel=medium -static-libgfortran -m64 /opt/openblas/lib/libopenblas.a /usr/local/lib/libscalapack.a /opt/openblas/lib/libopenblas.a -lm -lpthread -lgfortran -lm -lpthread -lgfortran -o para.exe solve_by_lu_parallelmpi_simple_light.for /opt/openblas/lib/libopenblas.a /usr/local/lib/libscalapack.a /opt/openblas/lib/libopenblas.a -lm -lpthread -lgfortran -lm -lpthread -lgfortran
which produces no errors or warnings, and run it with (for example):
mpirun -n 4 ./para.exe 944 2 2 32 > DUMP05
Where here we solve a system of 944 eqns using a 2x2 BLACS process array with block size of 32.
For this small N case we get the (successful run) output:
WE ARE SOLVING A SYSTEM OF 944 LINEAR EQUATIONS
PROC: 0 0 HAS MLOC, NLOC = 480 480
PROC: 0 0 ALLOCATING SPACE ...
PROC: 1 0 HAS MLOC, NLOC = 464 480
PROC: 1 0 ALLOCATING SPACE ...
PROC: 0 0 CONSTRUCTING MATRIX A AND RHS VECTOR B ...
PROC: 1 0 CONSTRUCTING MATRIX A AND RHS VECTOR B ...
PROC: 1 1 HAS MLOC, NLOC = 464 464
PROC: 1 1 ALLOCATING SPACE ...
PROC: 1 1 CONSTRUCTING MATRIX A AND RHS VECTOR B ...
PROC: 0 1 HAS MLOC, NLOC = 480 464
PROC: 0 1 ALLOCATING SPACE ...
PROC: 0 1 CONSTRUCTING MATRIX A AND RHS VECTOR B ...
PROC: 0 0 NOW SOLVING SYSTEM AX = B USING SCALAPACK PDGESV
.. PROC: 1 0 NOW SOLVING SYSTEM AX = B USING SCALAPACK PDGESV
.. PROC: 1 1 NOW SOLVING SYSTEM AX = B USING SCALAPACK PDGESV
.. PROC: 0 1 NOW SOLVING SYSTEM AX = B USING SCALAPACK PDGESV
..
INFO code returned by PDGESV = 0
So far so good. However, running instead with :
mpirun -n 4 ./para.exe 94423 2 2 32 > DUMP06
yields the following error (note such an execution requires 65 GB of memory, and takes around 45 mins on my machine):
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:
Backtrace for this error:
Backtrace for this error:
Backtrace for this error:
Backtrace for this error:
Backtrace for this error:
Backtrace for this error:
Backtrace for this error:
For some reason no backtrace information is printed, but running the same code with the PGI fortran compiler (on a different machine running red hat linux 7.3) yields failure with the following output:
[sca1993:113193] * Process received signal *
[sca1993:113193] Signal: Segmentation fault (11)
[sca1993:113193] Signal code: Address not mapped (1)
[sca1993:113193] Failing at address: 0x2b8c5a036390
[sca1993:113193] [ 0] /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../.. /lib64/libpthread.so.0(+0xf5d0)[0x2b900528c5d0]
[sca1993:113193] [ 1] /usr/local/pgi/linux86-64/17.7/lib/libblas.so.0(+0x280c950)[0x2b9003acc950]
[sca1993:113193] [ 2] /usr/local/pgi/linux86-64/17.7/lib/libblas.so.0(daxpy_k_HASWELL+0x7f)[0x2b9003acc54f]
[sca1993:113193] [ 3] /usr/local/pgi/linux86-64/17.7/lib/libblas.so.0(dger_k_HASWELL+0xd5)[0x2b9003ad6635]
[sca1993:113193] [ 4] /usr/local/pgi/linux86-64/17.7/lib/libblas.so.0(dger_+0x21f)[0x2b90013d9f5f]
[sca1993:113193] [ 5] ./para_try.exe[0x446e70]
[sca1993:113193] [ 6] ./para_try.exe[0x41b4ad]
[sca1993:113193] [ 7] ./para_try.exe[0x4071e1]
[sca1993:113193] [ 8] ./para_try.exe[0x406b39]
[sca1993:113193] [ 9] ./para_try.exe[0x404ba6]
[sca1993:113193] [10] ./para_try.exe[0x403654]
[sca1993:113193] [11] /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libc.so.6(__libc_start_main+0xf5)[0x2b9005cb83d5]
[sca1993:113193] [12] ./para_try.exe[0x403549]
[sca1993:113193] * End of error message *
If anyone has any suggestions I would be very grateful. Many thanks, Dan.