3

When I run the following code, I expect that once foo() has been executed, the memory used by it (basically to create m) would be released. However, that is not the case. To release this memory I need to restart the IPython console.

%%cython
# distutils: language = c++

import numpy as np
from libcpp.map cimport map as cpp_map

cdef foo():
    cdef:
        cpp_map[int,int]    m
        int i
    for i in range(50000000):
        m[i] = i

foo()

It will be great if someone could tell me why this is the case and also how to release this memory without restarting the shell. Thanks in advance.

Manish Goel
  • 843
  • 1
  • 8
  • 21
  • 2
    If you look into the generated cpp you will see, that `m` is allocated on the stack and thus memory is automatically freed once the foo() is done. Whether this freed memory is returned to the OS is another question - it depends on your memory allocator and nothing Cython can help you with. – ead Jun 11 '19 at 09:40
  • See for example: https://stackoverflow.com/q/2215259/5769463 – ead Jun 11 '19 at 09:42
  • I'd think the way to confirm this is to run `foo` twice - you won't see it use double the memory – DavidW Jun 11 '19 at 09:42
  • @ead Thanks for sharing the link, unfortunately, I am not very experienced in such low-level details so I was not able to understand much of it. But, would it be fair to say that there is no cython/python method to release this memory back to OS? – Manish Goel Jun 11 '19 at 09:56
  • @ManishGoel your map needs probably more than 1GB, I would expect your memory allocator to return lion's share of this memory to OS - if not you might be considering to use another memory allocator. You can tweak the behavior of the memory allocator as described in the link above (and use Cython for that), but it is an implementation detail - and surely too broad for this question. – ead Jun 11 '19 at 10:20
  • @ead When I test this on Linux I see the same thing as OP does - the memory doesn't appear to be released after the function (however calling the function a second time doesn't use any more memory). If it were me I wouldn't worry about it though – DavidW Jun 11 '19 at 10:57
  • @DavidW The problem is, the actual function takes more than 200GB of memory. Therefore, I need it to be freed so that other processes on the machine can continue working. – Manish Goel Jun 11 '19 at 11:34

1 Answers1

3

Effects your are seeing are more or less implementation details of your memory allocator (possible glibc's default allocator). glibc's memory allocator works as follows:

  • requests for small memory sizes are satisfied from arenas, which grow/whose number grows as needed.
  • request for large memory are directly taken from OS but also directly returned to OS as soon as they are freed.

One can tweak when the memory from those arenas is released using mallopt, but normally an internal heuristic is used which decides, when/if the memory should be returned to OS - which I most confess is kind of black magic to me.

The problem of std::map (and situation is similar for std::unordered_map) is, that it doesn't consist of a big chunk of memory which would be returned to OS immediately, but of a lot of small nodes (map is implemented as Red-Black-Tree by libstdc++) - so they all are from those arenas and the heuristic decides not return it to OS.

As we are using glibc's allocator, one could use the non-standard function malloc_trim to free the memory manually:

%%cython

cdef extern from "malloc.h" nogil:
     int malloc_trim(size_t pad)

def return_memory_to_OS():
    malloc_trim(0)

and now just call return_memory_to_OS() after every usage of foo.


The above solution is quick&dirty but is not portable. What you want to have is an custom allocator which would release the memory back to OS as soon as it is no longer used. That is a lot of work - but luckily we have already such an allocator at hand: CPython's pymalloc - since Python2.5 it returns memory to OS (even if it means sometimes trouble). However, we should also point out a big deficiency of pymalloc - it is not thread-safe, so it can be used only for code with gil!

Using pymalloc-allocator has not only the advantage of returning the memory to OS but also because pymalloc is 8byte-aligned while glibc's allocator is 32byte aligned the resulting memory consumption will be smaller (nodes of map[int,int] are 40 bytes which will cost only 40.5 bytes with pymalloc (together with overhead) while glibc will needs not less than 64 bytes).

My implementation of the custom allocator follows Nicolai M. Josuttis' example and implements only the really needed functionality:

%%cython -c=-std=c++11 --cplus

cdef extern from *:
    """
    #include <cstddef>   // std::size_t
    #include <Python.h>  // pymalloc

    template <class T>
    class pymalloc_allocator {
     public:
       // type definitions
       typedef T        value_type;
       typedef T*       pointer;
       typedef std::size_t    size_type;

       template <class U>
       pymalloc_allocator(const pymalloc_allocator<U>&) throw(){};
       pymalloc_allocator() throw() = default;
       pymalloc_allocator(const pymalloc_allocator&) throw() = default;
       ~pymalloc_allocator() throw() = default;

       // rebind allocator to type U
       template <class U>
       struct rebind {
           typedef pymalloc_allocator<U> other;
       };

       pointer allocate (size_type num, const void* = 0) {
           pointer ret = static_cast<pointer>(PyMem_Malloc(num*sizeof(value_type)));
           return ret;
       }

       void deallocate (pointer p, size_type num) {
           PyMem_Free(p);
       }

       // missing: destroy, construct, max_size, address
       //  -
   };

   // missing:
   //  bool operator== , bool operator!= 

    #include <utility>
    typedef pymalloc_allocator<std::pair<int, int>> PairIntIntAlloc;

    //further helper (not in functional.pxd):
    #include <functional>
    typedef std::less<int> Less;
    """
    cdef cppclass PairIntIntAlloc:
        pass
    cdef cppclass Less:
        pass


from libcpp.map cimport map as cpp_map

def foo():
    cdef:
        cpp_map[int,int, Less, PairIntIntAlloc] m
        int i
    for i in range(50000000):
        m[i] = i

Now, lion's share of the used memory is returned to OS once foo is done - on any operating system and memory allocator!


If memory consumption is an issue, one could switch to unorder_map which needs somewhat less memory. However, as of the moment unordered_map.pxd doesn't offer access to all template-parameters, so one will have to wrap it manually:

%%cython -c=-std=c++11 --cplus

cdef extern from *:
    """
    ....

    //further helper (not in functional.pxd):
    #include <functional>
    ...
    typedef std::hash<int> Hash;
    typedef std::equal_to<int> Equal_to;
    """
    ...
    cdef cppclass Hash:
        pass
    cdef cppclass Equal_to:
        pass

cdef extern from "<unordered_map>" namespace "std" nogil:
    cdef cppclass unordered_map[T, U, HASH=*,RPED=*, ALLOC=* ]:
        U& operator[](T&)

N = 5*10**8

def foo_unordered_pymalloc():
    cdef:
        unordered_map[int, int, Hash, Equal_to, PairIntIntAlloc] m
        int i
    for i in range(N):
        m[i] = i

Here are some benchmarks, which are obviously not complete, but probably show the direction pretty well (but for N=3e7 instead of N=5e8):

                                   Time           PeakMemory

map_default                        40.1s             1416Mb
map_default+return_memory          41.8s 
map_pymalloc                       12.8s             1200Mb

unordered_default                   9.8s             1190Mb
unordered_default+return_memory    10.9s
unordered_pymalloc                  5.5s              730Mb

The timings were done via %timeit magic and peak memory usage via via /usr/bin/time -fpeak_used_memory:%M python script_xxx.py.

I'm somewhat surprised, that pymalloc outperforms the glibc-allocator by so much and also that it seems as if memory allocations are the bottle-neck for the usual map! Maybe this is the price glibc must pay for supporting multi-threading.

unordered_map is faster and maybe needs less memory (ok, because of the rehashing the last part could be wrong).

ead
  • 32,758
  • 6
  • 90
  • 153