4

I have a python script, whose memory usage (observed in top) is growing indefinitely when the script is running.

My script is not supposed to store anything. It just receives some request (REST API), processes the request, and returns some result. The expected behaviour is that the memory usage stays constant.

I'm looking for a suitable memory profiler tool that would allow me to identify where the growing object is located in the code.

I had a look at FIL-profiler, but it seems to aim at considering "peak memory usage", which is I believe not my case.

thanks

EDIT1: I wanted to give pyrasite a try, but I could not make it work. Since the last update of this project is 2012 May, it may be that it is not anymore compatible with the current python (I'm using 3.8)

EDIT2: I gave a try to Pympler, following this post. I added the following code to my script:

from pympler import muppy, summary
all_objects = muppy.get_objects()
sum = summary.summarize(all_objects)
summary.print_(sum)

This outputs:

    types |   # objects |   total size
========= | =========== | ============
     dict |        7628 |      3.15 MB
      str |       24185 |      2.71 MB
     type |        1649 |      1.39 MB
    tuple |       13722 |      1.35 MB
     code |        7701 |      1.06 MB
      set |         398 |    453.06 KB
....

Which does not show any suspiciously large object (the scripts, when running, accumulates a memory usage going to GB scale)

EDIT3: I tried with tracemalloc: I put a tracemalloc.start() at the beginning of the script, then, at the end of the script, but before stopping (so that the memory usage is visibly very high, according to top), I do a

snapshot = tracemalloc.take_snapshot()
display_top(snapshot)

This displays the lines of code with maximal memory usage. But still, nothing in comparison to that observed in top. I tried also a gc.collect(), but this has no effect.

Loic Rosnay
  • 335
  • 2
  • 14
  • possible tools that can help you are **valgrind** or **ebpf & bcc** with memleak.py: https://github.com/iovisor/bcc/blob/master/tools/memleak.py – Daniser Jul 31 '20 at 09:12
  • have you seen the `tracemalloc` module? see https://stackoverflow.com/a/45679009/1358308 for details – Sam Mason Jul 31 '20 at 12:31
  • Not exactly what you’re asking for, but maybe you should look into using **weakref** on some of your bigger data objects. At least on any parts of the script that are persistent ( a Main object or globals). – JL Peyret Jul 31 '20 at 19:22

2 Answers2

4

I could not take benefit of any of the proposed memory profiler. It turned out that the problem lies in a possible bug in the library ujson, which is written in C. I guess that this is the reason why all those python memory profiler could not help here. I think I have to close the issue. A remaining question would be: Is there any python tool that allow to track a memory problem happening in a C module?

Loic Rosnay
  • 335
  • 2
  • 14
1

Assuming you are using the "VIRT" column from top, you cannot extrapolate from that number to an assumption that the number of your python objects are growing, or at least growing enough to explain the total size of the virtual address space.

For example, did you know that python uses pthreads underneath its threading code? That is significant because pthreads takes, as the default stack size, the value from "ulmimit -s" * 1K. So the default stack size for any new threads in python is often 8MB or even 40MB on some variants of Linux, unless you explicitly change the "ulimit -s" value in the parent of your process. Many server applications are multi-threaded so even having 2 additional threads would result in more virtual memory than you are showing from the Pympler output. You have to know how many threads you have in your process and what is the default stack size to understand how threads might contribute to the total VIRT value.

Also, underneath its own allocation mechanism python is using a mixture of mmap and malloc. In the case of malloc, if the program is multi-threaded, on linux libc malloc will use multiple arenas, which reserve 64MB ranges (called heaps) at a time, but only make the starts of these heaps readable and writable and leave the tails of these ranges inaccessible until the memory is needed. Those inaccessible tails are often large but don't really matter at all in terms of the committed memory for the process because inaccessible pages don't require any physical memory or swap space. Nonetheless, top counts in "VIRT" the entire range, including both the accessible start at the beginning of the range and the inaccessible start at the end of the range.

For example, consider the rather tiny python program, in which the main thtead starts 16 additional threads, each of which don't use much memory in python allocations:

import threading

def spin(seed):
    l = [ i * seed for i in range(64) ]
    while True:
       l = [ i * i % 97 for i in l ]

for i in range(16):
    t = threading.Thread(target=spin, args=[i])
    t.start()

We would not expect that program to result in all that large a process but here is what we see under top, looking at just that one process, which shows way more than 1GB of VIRT:

Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
Cpu(s):  1.0%us,  1.9%sy,  3.3%ni, 93.3%id,  0.5%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  264401648k total, 250450412k used, 13951236k free,  1326116k buffers
Swap: 69205496k total, 17321340k used, 51884156k free, 104421676k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                      
 9144 tim       20   0 1787m 105m  99m S 101.8  0.0  13:40.49 python                                                                                                                                       

We can understand the high VIRT value for such a program (and your program for that matter) by taking a live core of the running program (using gcore, for example) and opening the resulting core using chap, open source which can be found at https://github.com/vmware/chap and running commands as shown:

chap> summarize writable
17 ranges take 0x2801c000 bytes for use: stack
14 ranges take 0x380000 bytes for use: python arena
16 ranges take 0x210000 bytes for use: libc malloc heap
1 ranges take 0x1a5000 bytes for use: libc malloc main arena pages
11 ranges take 0xa9000 bytes for use: used by module
1 ranges take 0x31000 bytes for use: libc malloc mmapped allocation
2 ranges take 0x7000 bytes for use: unknown
62 writable ranges use 0x28832000 (679,682,048) bytes.

From the above we can see that the biggest single use of memory is 17 stacks. If we were to use "describe writable" we would see that 16 of those stacks take 40MB each, and this is all because we have neglected to explicitly tune the stack size to something more reasonable.

We can do something similar with inaccessible (not readable, writable or executable) regions and see that 16 "libc malloc heap tail reservation" ranges take almost 1GB of VIRT. This doesn't as it turns out, really matter for the committed memory of the process but it does make a rather scary contribution to the VIRT number.

chap> summarize inaccessible
16 ranges take 0x3fdf0000 bytes for use: libc malloc heap tail reservation
11 ranges take 0x15f9000 bytes for use: module alignment gap
16 ranges take 0x10000 bytes for use: stack overflow guard
43 inaccessible ranges use 0x413f9000 (1,094,684,672) bytes.

The reason that there are 16 such ranges is that each of the spinning threads was repeatedly making allocations, and this caused libc malloc, running behind the scene because it gets used by the python allocator, to carve out 16 arenas.

You could do a similar command for read only memory ("summarize readonly") or you could get more detail for any of the commands by changing "summarize" to "describe".

I can't say exactly what you will find when you run this on your own server, but it seems quite likely for a REST server to be multi-threaded, so I would guess that this will be a non-trivial contribution to the number TOP is showing you.

This still doesn't answer why the number is continuing to climb, but if you look at those numbers you can see where to look next. For example, in the above summary allocations that are handled by python without resorting to mmap don't take that much memory at only 3.5MB:

14 ranges take 0x380000 bytes for use: python arena

In your case the number might be larger, so you might try any of the following:

describe used
describe free
summarize used
summarize free

Note that the above commands will also cover native allocations (for example ones done by a shared library) as well as python allocations.

Following similar steps on your own program should give you a much better understanding of those "top" numbers you are seeing.

Tim Boddy
  • 1,019
  • 7
  • 13
  • Thanks for the detailed explanation! I will have a look at this.. Actually I was just considering the column "%MEM". – Loic Rosnay Aug 02 '20 at 07:18
  • I think %MEM has to do with what fraction of physical memory is used by the process at present. I get this sense based on the apparent linear relation with RES. That is in general not a great indicator of process growth because a process may have lots of committed address space that is currently swapped out. – Tim Boddy Aug 03 '20 at 14:02
  • 1
    Well in my case, the %MEM was a very concrete indication. Letting the program run 1 hour on my laptop, the memory usage would go to 100%, then start to swap, and then to crash... – Loic Rosnay Aug 05 '20 at 07:40