2

A parallel fortran code that solves a set of linear simultaneous equations Ax = b using the scalapack routine PDGESV fails (exiting with segmentation fault) when the no. of equations, N, becomes large. I have not identified the exact value of N at which problems arise, but, for example, the code works perfectly for all the values I have tested up to N= 50000, but fails at N=94423.

In particular, the failure occurs during the call to the scalapack PDGESV routine (i.e. not when allocating / deallocating memory); it enters routine PDGESV, but does not leave this routine.

I am working on a Linux Mint 18.3 Sylvia system with 148 GB memory using a Intel(R) Xeon(R) CPU E5-1660 v4 @ 3.20GHz processor. I am using mpifortran employing gfortran.

I am somewhat confident that there is not a problem with the fortran code itself, as the code works perfectly for every values of N and process configuration I have tried up to N=50000, exiting with the INFO=0 code that indicates no errors have occurred. (I also ran a slightly modified version of the program that explicitly checked the residual for the solution matrix x*, i.e. computed Ax* - b and found, correctly, maximum absolute values close to zero). If there was some problem with the matrix being singular we would of course instead observe an exit from the PDGESV routine with a non-zero INFO code.

The machine's memory also appears to be sufficient ; for the problem case N=94423 we only require 65 GB memory versus the available 148 GB memory, and there is no problem at allocation time (furthermore a serial code solving the same problem, and using 65 GB memory, runs without errors).

My feeling is that instead there is some problem with maybe exceeding some default limit on what memory is available to a single process in mpi? i.e. perhaps I am simply missing some appropriate FLAGS at compilation / run time?

I have tried using the 'ulimit -s unlimited' command , but this did not resolve the problem.

I copy the fortran code below; this is a simple test program that 1) allocates space for the matrix A and vector b, 2) fills their entries with random entries 3) calls PDGESV and then 4) deallocates the memory.

I list the compilation /execution commands (using mpifortran/ gfortran) I used below.

Note I have also tried using the PGI fortran compiler, and observed the same error for the same test case (see error output below).

Fortran code:

      PROGRAM SOLVE_LU
      USE MPI
      IMPLICIT NONE
      INTEGER :: N
      DOUBLE PRECISION, ALLOCATABLE, DIMENSION(:,:) :: LOCAL_A
      DOUBLE PRECISION, ALLOCATABLE, DIMENSION(:) :: LOCAL_B 
      INTEGER :: ISTATUS
C     FOR LAPACK PDGESV CALL
      INTEGER  :: INFO,  NRHS, IA, JA, IB, JB
      INTEGER, ALLOCATABLE, DIMENSION (:) :: IPIV
c     FOR READING COMMAND LINE ARGUMENTS
      INTEGER :: IARGC, N_COMMAND_ARG
      CHARACTER :: ARGV*10 
C     WE USE FOLLOWING COMMAND LINE ARGUMENTS 
C     ARG 1 : N (DIMENSION OF PROBLEM)
C     ARG 2 : NPROW (NO. OF ROWS OF PROCESSES IN A RECTANGULAR ARRAY)
C     ARG 3 : NPCOL (NO. OF COLUMNS OF PROCESSES IN A RECTANGULAR ARRAY)
C     ARG 4 : BLACS BLOCK SIZE MB (BLOCKS ARE OF SIZE MB * MB) 
C   
c     FOR PARALLEL PROCESS ARRAY
      INTEGER  :: NPROW, NPCOL, ICTXT,MYROW, MYCOL, MB, NB, MLOC, NLOC
      INTEGER :: IDESCA(9), IDESCB(9)
      INTEGER :: IERR
      INTEGER :: NUMROC


c     for random number seed
      INTEGER :: ISEEDSIZE
      INTEGER, ALLOCATABLE, DIMENSION ( :) :: SEED

C      ----------------------------------------
C      -------  EXECUTABLE STATEMENTS   -------


C      ===============================================
C      READ IN COMMAND LINE ARGUMENTS IF PRESENT

      N_COMMAND_ARG = iargc()
      IF (N_COMMAND_ARG == 2) THEN
          WRITE(*,*) 'ILLEGAL NO. OF COMMAND LINE PARAMETERS'
          STOP
      ENDIF
      IF (N_COMMAND_ARG .GE. 1)THEN
          CALL GETARG(1,argv)
C          WRITE(*,*)'ARGV = ',ARGV
          READ (ARGV,'(I10)') N
      ELSE
          N = 100
      ENDIF   

      IF (N_COMMAND_ARG .GE. 3)THEN
          CALL GETARG(2,argv)
          READ (ARGV,'(I10)') NPROW
          CALL GETARG(3,argv)
          READ (ARGV,'(I10)') NPCOL

      ELSE
          NPROW = 2
          NPCOL = 2
      ENDIF 

      IF (N_COMMAND_ARG .GE. 4)THEN
          CALL GETARG(4,argv)
          READ (ARGV,'(I10)') MB
      ELSE
          MB = 8
      ENDIF
      NB = MB

C     ==============================================
C     INITIALISE THE BLACS PROCESS GRID, FIND DIMENSIONS OF LOCAL
C     MATRICES / VECTORS AND ALLOCATE SPACE

      CALL SL_INIT(ICTXT, NPROW, NPCOL)
      CALL BLACS_GRIDINFO( ICTXT, NPROW, NPCOL, MYROW, MYCOL )

      MLOC = NUMROC(N, MB, MYROW, 0, NPROW)
      NLOC = NUMROC(N, NB, MYCOL, 0, NPCOL)

      IF( MYROW.EQ.0 .AND. MYCOL.EQ.0 )WRITE(*,*)
     @       'WE ARE SOLVING A SYSTEM OF ', N, ' LINEAR EQUATIONS'

      WRITE(*,*) 'PROC: ',MYROW, MYCOL,'HAS  MLOC, NLOC =', MLOC,NLOC

c      ==============================================
C     ALLOCATE SPACE FOR MATRIX A AND VECTORS B AND X

      WRITE(*,*) 'PROC: ',MYROW, MYCOL,' ALLOCATING SPACE ...'

      ALLOCATE ( LOCAL_A(MLOC,NLOC), STAT = ISTATUS )
      IF(ISTATUS .NE. 0) THEN
          WRITE(*,*)'UNABLE TO ALLOCATE LOCAL_A, PROCESS: ',MYROW,MYCOL
          STOP
      ENDIF

      ALLOCATE ( LOCAL_B(MLOC), STAT = ISTATUS )
      IF (ISTATUS /= 0) THEN
          WRITE(*,*)
     @ ' FAILED TO ALLOCATE SPACE FOR LOCAL_B, PROCESS: ',MYROW,MYCOL
          STOP
      ENDIF

c     BLACS DESCRIPTOR FOR A AND ITS COPY
      CALL DESCINIT (IDESCA, N, N, MB, NB, 0, 0,
     @               ICTXT, MLOC, IERR)


c     BLACS DESCRIPTOR FOR B AND SOLN VECTOR X
      CALL DESCINIT (IDESCB, N, 1, MB, 1, 0, 0, ICTXT, MLOC, IERR)  

c      ==============================================
C      FILL ENTRIES OF MATRIX A AND R.H.S. VECTOR B WITH RANDOM ENTRIES

      WRITE(*,*)'PROC: ',MYROW, MYCOL,
     @        ' CONSTRUCTING MATRIX A AND RHS VECTOR B ...'

      CALL RANDOM_SEED

      CALL RANDOM_SEED ( SIZE = ISEEDSIZE ) ! GET SIZE OF SEED ARRAY

      ALLOCATE ( SEED(1:ISEEDSIZE) )
      CALL RANDOM_SEED ( GET = SEED )

      SEED(1) = SEED(1) + NPCOL*MYROW + MYCOL ! ENSURES DIFFERENT SEED
                                              ! FOR EACH PROCESS
      CALL RANDOM_SEED ( PUT = SEED )

      CALL RANDOM_NUMBER(LOCAL_B)

      CALL RANDOM_NUMBER(LOCAL_A)

c      ==============================================
C      CALL LAPACK LU SOLVER ROUTINE

      WRITE(*,*)'PROC: ',MYROW, MYCOL,
     @    'NOW SOLVING SYSTEM AX = B USING SCALAPACK PDGESV ..'
      ALLOCATE ( IPIV(MLOC + MB), STAT=ISTATUS )
      IF(ISTATUS /= 0) THEN
          WRITE(*,*)'UNABLE TO ALLOCATE IPIV, PROCESS: ',MYROW,MYCOL
          STOP
      ENDIF


      IA = 1
      JA = 1
      IB = 1
      JB = 1
      NRHS = 1
      INFO = 0

      CALL PDGESV(N, NRHS, LOCAL_A, IA, JA, IDESCA, IPIV, 
     @            LOCAL_B, IB, JB, IDESCB, INFO )

      IF( MYROW.EQ.0 .AND. MYCOL.EQ.0 ) THEN
          WRITE(*,*)
          WRITE(*,*) 'INFO code returned by PDGESV = ', INFO
          WRITE(*,*)
      END IF


c      ==============================================
C     DEALLOCATE MEMORY
      DEALLOCATE(LOCAL_A, STAT=ISTATUS)
      IF(ISTATUS /= 0) THEN
          WRITE(*,*)'UNABLE TO DEALLOCATE ' 
          STOP
      ENDIF   


      DEALLOCATE(LOCAL_B, STAT=ISTATUS)
      IF(ISTATUS /= 0) THEN
          WRITE(*,*)'UNABLE TO DEALLOCATE ' 
          STOP
      ENDIF   


      DEALLOCATE(IPIV, STAT=ISTATUS)
      IF(ISTATUS /= 0) THEN
          WRITE(*,*)'UNABLE TO DEALLOCATE ' 
          STOP
      ENDIF   

c     ===================================================
c     RELEASE BLACS CONTEXT

      CALL BLACS_GRIDEXIT(ictxt)
      CALL BLACS_EXIT(0)


      END PROGRAM SOLVE_LU

I compile the above code with : mpifort -Wall -mcmodel=medium -static-libgfortran -m64 /opt/openblas/lib/libopenblas.a /usr/local/lib/libscalapack.a /opt/openblas/lib/libopenblas.a -lm -lpthread -lgfortran -lm -lpthread -lgfortran -o para.exe solve_by_lu_parallelmpi_simple_light.for /opt/openblas/lib/libopenblas.a /usr/local/lib/libscalapack.a /opt/openblas/lib/libopenblas.a -lm -lpthread -lgfortran -lm -lpthread -lgfortran

which produces no errors or warnings, and run it with (for example):

mpirun -n 4 ./para.exe 944 2 2 32 > DUMP05

Where here we solve a system of 944 eqns using a 2x2 BLACS process array with block size of 32.

For this small N case we get the (successful run) output:

WE ARE SOLVING A SYSTEM OF 944 LINEAR EQUATIONS

PROC: 0 0 HAS MLOC, NLOC = 480 480

PROC: 0 0 ALLOCATING SPACE ...

PROC: 1 0 HAS MLOC, NLOC = 464 480

PROC: 1 0 ALLOCATING SPACE ...

PROC: 0 0 CONSTRUCTING MATRIX A AND RHS VECTOR B ...

PROC: 1 0 CONSTRUCTING MATRIX A AND RHS VECTOR B ...

PROC: 1 1 HAS MLOC, NLOC = 464 464

PROC: 1 1 ALLOCATING SPACE ...

PROC: 1 1 CONSTRUCTING MATRIX A AND RHS VECTOR B ...

PROC: 0 1 HAS MLOC, NLOC = 480 464

PROC: 0 1 ALLOCATING SPACE ...

PROC: 0 1 CONSTRUCTING MATRIX A AND RHS VECTOR B ...

PROC: 0 0 NOW SOLVING SYSTEM AX = B USING SCALAPACK PDGESV

.. PROC: 1 0 NOW SOLVING SYSTEM AX = B USING SCALAPACK PDGESV

.. PROC: 1 1 NOW SOLVING SYSTEM AX = B USING SCALAPACK PDGESV

.. PROC: 0 1 NOW SOLVING SYSTEM AX = B USING SCALAPACK PDGESV

..

INFO code returned by PDGESV = 0

So far so good. However, running instead with :

mpirun -n 4 ./para.exe 94423 2 2 32 > DUMP06

yields the following error (note such an execution requires 65 GB of memory, and takes around 45 mins on my machine):

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:

Backtrace for this error:

Backtrace for this error:

Backtrace for this error:

Backtrace for this error:

Backtrace for this error:

Backtrace for this error:

Backtrace for this error:

For some reason no backtrace information is printed, but running the same code with the PGI fortran compiler (on a different machine running red hat linux 7.3) yields failure with the following output:

[sca1993:113193] * Process received signal *

[sca1993:113193] Signal: Segmentation fault (11)

[sca1993:113193] Signal code: Address not mapped (1)

[sca1993:113193] Failing at address: 0x2b8c5a036390

[sca1993:113193] [ 0] /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../.. /lib64/libpthread.so.0(+0xf5d0)[0x2b900528c5d0]

[sca1993:113193] [ 1] /usr/local/pgi/linux86-64/17.7/lib/libblas.so.0(+0x280c950)[0x2b9003acc950]

[sca1993:113193] [ 2] /usr/local/pgi/linux86-64/17.7/lib/libblas.so.0(daxpy_k_HASWELL+0x7f)[0x2b9003acc54f]

[sca1993:113193] [ 3] /usr/local/pgi/linux86-64/17.7/lib/libblas.so.0(dger_k_HASWELL+0xd5)[0x2b9003ad6635]

[sca1993:113193] [ 4] /usr/local/pgi/linux86-64/17.7/lib/libblas.so.0(dger_+0x21f)[0x2b90013d9f5f]

[sca1993:113193] [ 5] ./para_try.exe[0x446e70]

[sca1993:113193] [ 6] ./para_try.exe[0x41b4ad]

[sca1993:113193] [ 7] ./para_try.exe[0x4071e1]

[sca1993:113193] [ 8] ./para_try.exe[0x406b39]

[sca1993:113193] [ 9] ./para_try.exe[0x404ba6]

[sca1993:113193] [10] ./para_try.exe[0x403654]

[sca1993:113193] [11] /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libc.so.6(__libc_start_main+0xf5)[0x2b9005cb83d5]

[sca1993:113193] [12] ./para_try.exe[0x403549]

[sca1993:113193] * End of error message *

If anyone has any suggestions I would be very grateful. Many thanks, Dan.

0 Answers0