Python3: Give unused interpreter memory back to the OS

Question

Background / rationale

I have a Python software which fetches event-based data from a measurement instrument, processes it and writes the results to disk. The input events use a fair amount of memory, somewhere near 10MB/event. When the input event rate is high, the processing may not be fast enough, causing the events to pile up in the internal queue. This goes on until the available memory is almost used up, when it directs the instrument to throttle the acquisition rate (which works, but reduces accuracy). This moment is detected by watching the available system memory using psutil.virtual_memory().available. For the best results, the throttling should be disabled as soon as enough memory has been made available from processed events. This is where the trouble comes in.

As it seems, the CPython interpreter does not (or not always) return freed memory to the OS, which makes psutil (and also the gnome-system-monitor) report insufficient available memory. However, the memory is actually available, as manually disabling the throttling will again fill the queue without the consumption increasing further, unless even more events are placed into the queue than before.

The following example may show this behaviour. On my computer, maybe 50% of the invocations showed the problem, while the rest properly freed the memory. It happened a few times that the memory was freed at the end of iteration 0, but not at the end of iterations 1 and 2, so the behaviour seems to be a bit random.

#!/usr/bin/env python3

import psutil
import time
import queue

import numpy as np

def get_avail() -> int:
    avail = psutil.virtual_memory().available
    print(f'Available memory: {avail/2**30:.2f} GiB')
    return avail

q: 'queue.SimpleQueue[np.ndarray]' = queue.SimpleQueue()

for i in range(3):
    print('Iteration', i)
    # Allocate data for 90% of available memory.
    for i_mat in range(round(0.9 * get_avail() / 2**24)):
        q.put(np.ones((2**24,), dtype=np.uint8))
    # Show remaining memory.
    get_avail()
    time.sleep(5)
    # The data is now processed, releasing the memory.
    try:
        n = 0
        while True:
            n += q.get_nowait().max()
    except queue.Empty:
        pass
    print('Result:', n)
    # Show remaining memory.
    get_avail()
    print(f'Iteration {i} ends')
    time.sleep(5)
print('Program done.')
get_avail()

The expected behaviour would be low available memory before the result is printed and high afterwards:

Iteration 0
Available memory: 22.24 GiB
Available memory: 2.17 GiB
Result: 1281
Available memory: 22.22 GiB
Iteration 0 ends

However, it may also end up like this:

Iteration 1
Available memory: 22.22 GiB
Available memory: 2.19 GiB
Result: 1280
Available memory: 2.36 GiB
Iteration 1 ends

Integrating explicit calls to the garbage collector like

    print('Result:', n)
    # Show remaining memory.
    get_avail()
    gc.collect(0)
    gc.collect(1)
    gc.collect(2)
    get_avail()
    print(f'Iteration {i} ends')

does not help, the memory may still stay used.

I'm aware that there would be workarounds, like e.g. checking the queue size instead of the available memory. But this will make the system more prone to resource exhaustion in case some other process happens to consume lots of memory. Using multiprocessing would not fix the issue, as the event fetching must be done single-threaded, so the fetching process would always face the same problem in its queue.

Questions

How can I query the interpreter's memory management to find out how much memory is used by referenced objects and how much is just reserved for future use and not given back to the OS?
How can I force the interpreter to give back reserved memory back to the OS, so that the reported available memory actually increases?

The target platform is Ubuntu 20.04+ and CPython 3.8+, support for previous versions or other flavours is not required.

Thanks.

I don't think you're going to be able to "return" the memory to the OS. It's possible that some entirely unused pages may be paged out, but beyond that, you're stuck, short of terminating the process and starting a new one. — Tom Karzes, Jul 02 '21 at 13:29
Just to be clear, the garbage collector is only needed to break reference cycles; otherwise, any memory the *Python* memory allocator can reclaim is done so based on when the reference count reaches zero. Either way, though, nothing in CPython will release memory back to the OS. — chepner, Jul 02 '21 at 13:30
Why throttle, though, based on memory available from the OS? If Python is still holding that memory, it will reuse it when possible rather than requesting more from the OS. — chepner, Jul 02 '21 at 13:31
@Tom Karzes: There are indeed references which suggest that the interpreter does not give memory back to the OS, ever. But the example code shows that it can happen, just not always. — Philipp Burch, Jul 02 '21 at 13:40
@chepner: Throttling is required to avoid the process being killed due to out of memory. The amount of available memory is used to decide when to throttle and, ideally, also when to go back to full speed acquisition. The interpreter memory is of course reused, but the software does not know how much of this reserved memory is available. — Philipp Burch, Jul 02 '21 at 13:43
@PhilippBurch I think you're jumping to unfounded conclusions. Once the process expands its address space, that memory is part of the process. As I said, there may be other effects that you're seeing, such as portions of memory being paged out, but they're still part of the address space. — Tom Karzes, Jul 02 '21 at 14:07
@TomKarzes: I wasn't aware that malloc() can use both an internal heap and mapped memory pages. What do you mean by "paging out"? Unmapping is what I'm after; should you mean swapping, then no, the memory was not moved to swap space, it was really given back as available. Anyway, see my answer below for more stuff and a solution to my original problem. — Philipp Burch, Jul 05 '21 at 08:13
@PhilippBurch It doesn't. It uses an internal heap. I'm saying that if all of the memory on a given page goes unreferenced for a while, then the OS may choose to page it out. It has nothing to do with explicit page mapping. — Tom Karzes, Jul 05 '21 at 08:17

score 3 · Accepted Answer · answered Jul 05 '21 at 08:08

It's not Python or NumPy

As already indicated in the comments to the question, the observed effect is not specific to Python (or NumPy, as the memory is actually used for large ndarrays). Instead, it is a feature of the C runtime used, in this case the glibc.

Heap and memory mapping

When memory is requested using malloc() (as done by NumPy when an array is allocated), the runtime decides if it should use the brk syscall for smaller chunks or mmap for larger chunks. sbrk is used to increase the heap size. Allocated heap space may be given back to the OS, but only if there is enough continugous space at the top of the heap. This means that already a few bytes for an object that happens to be at the top end of the heap may effectively prevent the process to give any heap memory back to the OS. The memory is not wasted, as the runtime will use other freed space on the heap for subsequent calls to malloc(), but the memory is still used by the process and therefore never reported as available until the process terminates.

Allocating memory pages via mmap() is less efficient, but it has the benefit that such pages can be given back to the OS when no longer needed. The performance hit is because the kernel is involved whenever memory pages are mapped or unmapped; especially since the kernel has to zero out mapped pages for security reasons.

The mmap threshold

malloc() uses a threshold on the requested amount of memory to decide if it should use the heap or mmap(). This threshold is dynamic in recent versions of the glibc, but it may be changed using the mallopt() function:

M_MMAP_THRESHOLD

[...]

Note: Nowadays, glibc uses a dynamic mmap threshold by default. The initial value of the threshold is 128*1024, but when blocks larger than the current threshold and less than or equal to DEFAULT_MMAP_THRESHOLD_MAX are freed, the threshold is adjusted upwards to the size of the freed block. When dynamic mmap thresholding is in effect, the threshold for trimming the heap is also dynamically adjusted to be twice the dynamic mmap threshold. Dynamic adjustment of the mmap threshold is disabled if any of the M_TRIM_THRESHOLD, M_TOP_PAD, M_MMAP_THRESHOLD, or M_MMAP_MAX parameters is set.

The threshold can be adjusted using at least two ways:

Using a call to mallopt().
By setting the environment variable MALLOC_MMAP_THRESHOLD_ (note the trailing underscore).

Applied to the example code

The example allocates (and deallocates) memory in chunks of 2**24 bytes or 16MiB. According to the theory, a fixed MMAP_THRESHOLD somewhat below this value should therefore ensure that all large arrays are allocated using mmap(), allowing them to be unmapped and given back to the OS.

First a run without modification:

$ ./test_mem.py 
Iteration 0
Available memory: 21.45 GiB
Available memory: 2.17 GiB
Result: 1235
Available memory: 21.50 GiB
Iteration 0 ends
Iteration 1
Available memory: 21.50 GiB
Available memory: 2.13 GiB
Result: 1238
Available memory: 3.95 GiB
Iteration 1 ends
Iteration 2
Available memory: 4.02 GiB
Available memory: 4.02 GiB
Result: 232
Available memory: 4.02 GiB
Iteration 2 ends
Program done.
Available memory: 4.02 GiB

The memory is not returned in iterations 1 and 2.

Let's set a fixed threshold of 1MiB now:

$ MALLOC_MMAP_THRESHOLD_=1048576 ./test_mem.py 
Iteration 0
Available memory: 21.55 GiB
Available memory: 2.13 GiB
Result: 1241
Available memory: 21.52 GiB
Iteration 0 ends
Iteration 1
Available memory: 21.52 GiB
Available memory: 2.11 GiB
Result: 1240
Available memory: 21.52 GiB
Iteration 1 ends
Iteration 2
Available memory: 21.51 GiB
Available memory: 2.12 GiB
Result: 1239
Available memory: 21.53 GiB
Iteration 2 ends
Program done.
Available memory: 21.53 GiB

As can be seen, the memory is successfully given back to the OS in all three iterations. As an alternative, the setting can also be integrated into the Python script by a call to mallopt() using the ctypes module:

#!/usr/bin/env python3

import ctypes
import psutil
import time
import queue

import numpy as np

libc = ctypes.cdll.LoadLibrary("libc.so.6")
M_MMAP_THRESHOLD = -3

# Set malloc mmap threshold.
libc.mallopt(M_MMAP_THRESHOLD, 2**20)

# ...

Disclaimer: These solutions/workarounds are far from being platform-independent, as they make use of specific glibc features.

Note

The above text mainly answers the more important second question "How can I force the interpreter to give back reserved memory back to the OS, so that the reported available memory actually increases?". As for the first question "How can I query the interpreter's memory management to find out how much memory is used by referenced objects and how much is just reserved for future use and not given back to the OS?", I was not able to find a satisfactory answer. malloc_stats():

libc = ctypes.cdll.LoadLibrary("libc.so.6")
# ... script here ...
libc.malloc_stats()

gives some numbers, but those results:

Arena 0:
system bytes     = 1632264192
in use bytes     =    4629984
Total (incl. mmap):
system bytes     = 1632858112
in use bytes     =    5223904
max mmap regions =       1236
max mmap bytes   = 20725514240

for a script run without changing the mmap threshold seem a bit confusing to me. 5MiB could be the actual used memory when the script ends, but what about the "system bytes"? The process still uses almost 20GiB at this time, so the indicated 1.6GiB somehow don't fit in the picture at all.