1

I have written a scientific code and, as usual, this boils to calculating the coefficients in an algebraic Eigenvalue equation: Calculating these coefficients requires integrating over multi-dimensional arrays and this quickly inflates memory usage drastically. Once the matrix coefficients are calculated, the original, pre-integation multi-dimensional arrays can be deallocated and intelligent solvers take over, so memory usage ceases to be the big issue. As you can see there is a bottleneck and on my 64 bit, 4 core, 8 threads, 8GB ram laptop the program crashes due to insufficient memory.

I am therefore implementing a system that keeps memory usage in check by limiting the size of the tasks that the MPI processes can take on when calculating some of the Eigenvalue matrix elements. When finished they will then look for remaining jobs to be done so the matrix still gets filled but in a more sequential and less parallel way.

I was therefore checking how much memory I can allocate and here is where the confusion starts: I allocate doubles with size 8 bytes (checked using sizeof(1)) and look at the allocation status.

Though I have 8 GB of ram available running a test with only 1 process, I can allocate an array of up to size (40000,40000), which corresponds to about 13GB of memory! My first question is thus: How is this possible? Is there so much virtual memory?

Secondly, I realized that I can also do the same thing for multiple processes: Up to 16 processes can, simultaneously allocate these massive arrays!

This cannot be right?

Does somebody understand why this happens? And whether I am doing something wrong?

Edit:

Here is a code that produces the aforementioned miracle, at least on my machine. However, when I set the elements of the arrays to some value it indeed behaves as it should and crashes--or at least starts behaving very slowly, which I guess is due to the fact that slow virtual memory is used?

program test_miracle
    use ISO_FORTRAN_ENV
    use MPI

    implicit none

    ! global variables
    integer, parameter :: dp = REAL64                                           ! double precision
    integer, parameter :: max_str_ln = 120                                      ! maximum length of filenames
    integer :: ierr                                                             ! error variable
    integer :: n_procs                                                          ! MPI nr. of procs

    ! start MPI
    call MPI_init(ierr)                                                         ! initialize MPI
    call MPI_Comm_size(MPI_Comm_world,n_procs,ierr)                             ! nr. MPI processes
    write(*,*) 'RUNNING MPI WITH', n_procs, 'processes'

    ! call asking for 6 GB
    call test_max_memory(6000._dp)
    call MPI_Barrier(MPI_Comm_world,ierr)

    ! call asking for 13 GB
    call test_max_memory(13000._dp)
    call MPI_Barrier(MPI_Comm_world,ierr)

    ! call asking for 14 GB
    call test_max_memory(14000._dp)
    call MPI_Barrier(MPI_Comm_world,ierr)

    ! stop MPI
    call MPI_finalize(ierr)

contains
    ! test whether maximum memory feasible
    subroutine test_max_memory(max_mem_per_proc)
        ! input/output
        real(dp), intent(in) :: max_mem_per_proc                                ! maximum memory per process

        ! local variables
        character(len=max_str_ln) :: err_msg                                    ! error message
        integer :: n_max                                                        ! maximum size of array
        real(dp), allocatable :: max_mem_arr(:,:)                               ! array with maximum size
        integer :: ierr                                                         ! error variable

        write(*,*) ' > Testing whether maximum memory per process of ',&
            &max_mem_per_proc/1000, 'GB is possible'

        n_max = ceiling(sqrt(max_mem_per_proc/(sizeof(1._dp)*1.E-6)))

        write(*,*) '   * Allocating doubles array of size', n_max

        allocate(max_mem_arr(n_max,n_max),STAT=ierr)
        err_msg = '   * cannot allocate this much memory. Try setting &
            &"max_mem_per_proc" lower'
        if (ierr.ne.0) then
            write(*,*) err_msg
            stop
        end if

        !max_mem_arr = 0._dp                                                     ! UNCOMMENT TO MAKE MIRACLE DISSAPEAR


        deallocate(max_mem_arr)

        write(*,*) '   * Maximum memory allocatable'
    end subroutine test_max_memory
end program test_miracle

To be saved in test.f90 and subsequently compiled and run with

mpif90 test.f90 -o test && mpirun -np 2 ./test
Toon
  • 187
  • 11
  • 2
    Could you post a minimal working code that reproduces this miracle ? – francis Sep 21 '15 at 17:41
  • 9
    Try setting all the memory to zero - i.e. array = 0.0. What happens then? I suspect this is a manifestation of http://stackoverflow.com/questions/864416/are-some-allocators-lazy which for HPC is mostly a pain in the proverbial ... – Ian Bush Sep 21 '15 at 17:51
  • 1
    @IanBush I'm very certain this is exactly the problem. In fact, I ran into this exact issue earlier today. OP, if as soon as you allocate the memory, you try and set it to 0.0, it will most assuredly crash. – NoseKnowsAll Sep 22 '15 at 02:56

1 Answers1

6

When you do an allocate statement, you reserve a domain in the virtual memory space. The virtual space is the sum of the physical memory + the swap + maybe some extra possible space due to some overcommit possibility which will assume you will not use all the reservation.

But the memory is not yet physically reserved until you write something into it. When you write something into the memory, the system will physically allocate the corresponding page for you. If you don't initialize your array, and if your array is very sparse, it is possible that there are many pages which are never written, so the memory is never physically fully used.

When you see the system slowing down, it may be that the system is swapping pages to disk because the physical memory is full. If you have 8GB RAM and 8GB swap on disk, your calculation can run (very slowly thow...)

This mechanism is pretty good in NUMA environments, since this "first touch policy" will allocate the memory close to the CPU which first writes into it. In this way, you can initialize an array in an OpenMP loop to physically place the memory close to the CPUs that will use it.

Anthony Scemama
  • 1,563
  • 12
  • 19