Trouble finding memory leak (tracemalloc/objgraph/gc not helping)

Question

I have a network process that communicates over TCP collecting data, deserialising it and finally storing parts in a LevelDB key/value store (via Plyvel).

Slowly over time it will consume all available memory to the point that my whole system locks up (Ubuntu 18.04). I'm trying to diagnose the cause but I'm running out of ideas how to investigate further.

The main suspects I have in mind is the data streams we operate on to deserialise the objects. The general thing done there is: receive data from an asyncio.StreamReader and call deserialize_from_bytes (see next)

    @classmethod
    def deserialize_from_bytes(cls, data_stream: Union[bytes, bytearray]):
        """ Deserialize object from a byte array. """
        br = BinaryReader(stream=data_stream)
        inv_payload = cls()
        try:
            inv_payload.deserialize(br)
        except ValueError:
            return None
        finally:
            br.cleanup()
        return inv_payload

where BinaryReader initialises like this

   def __init__(self, stream: Union[io.BytesIO, bytes, bytearray]) -> None:
        super(BinaryReader, self).__init__()

        if isinstance(stream, (bytearray, bytes)):
            self._stream = io.BytesIO(stream)
        else:
            self._stream = stream

and cleanup() is a convenience wrapper for self._stream.close()

What I've tried:

I started with trying out the display top 10 snippet of tracemalloc. At a point in time where /proc/$mypid/status shows a memory usage of 2GB (by VmSize), the most consuming item from tracemalloc reports a mere 38MB, followed by the second and third being 3MB and 360Kb.

This already raises a question; where is the other ~1.958 GB?
Not helped by the above output, I tried objgraph. I break into the process with pdb and use

import objgraph
objgraph.show_most_common_types(limit=20)

to get

function                   16451
tuple                      11456
dict                       10371
weakref                    3058
list                       2893
cell                       2446
Traceback                  2277
Statistic                  2277
_Binding                   2109
getset_descriptor          1814
type                       1680
builtin_function_or_method 1469
wrapper_descriptor         1311
method_descriptor          1284
frozenset                  992
property                   983
module                     810
ModuleSpec                 808
SourceFileLoader           738
Attrs                      593

I don't see any objects specific to my program in that list (which could indicate references not being released). From other objgraph examples I found the above counts don't seem out of the ordinary.
I inspected a couple of function objects and always found something similar to this (asyncio related) which doesn't suggest any memory is leaking there

(Pdb) objgraph.at(0x10f309b70)
<function _run_coroutine.<locals>.step_next.<locals>.continue_ at 0x10f309b70>

Please correct me here if I'm missing something. Again, no step further

A final try to see if I can enforce any memory being released I manually run gc.collect(). Repeatedly calling it gives values between 25-500 for number of unreachable objects. It doesn't trigger me as an issue (should it?). I also tried running the program with gc.set_debug(gc.DEBUG_LEAK) but that's generates so much output I can't make anything of it.

Any tips what I can try from here on?

Yes. you can find the whole file here: https://github.com/ixje/neo-python/blob/asyncio_alpha2/neo/Network/neonetwork/core/io/binary_reader.py — ixje, May 15 '19 at 18:17
Have you tried to use memory allocation profiling tools line https://docs.python.org/3/library/tracemalloc.html to find lines with most memory allocation calls? — Prisacari Dmitrii, May 26 '19 at 14:11
May the `data_stream` parameter prevent the `BinaryReader` from GarbageCollection at the end of the `deserialize_from_bytes` functional scope? As long as `data_stream` is not destroyed you keep creating new instances of `BinaryReader` each call without destroying them. — Wör Du Schnaffzig, Oct 20 '20 at 11:50
if that would be the case then I'd expect the BinaryReader class to show up in the objgraph — ixje, Oct 21 '20 at 12:00
@ixje Did you come up with a way to narrow down the culprit by any chance? — kazarey, Feb 10 '21 at 21:24
@kazarey I have not, but I have also moved on from the project. — ixje, Feb 12 '21 at 08:07
If you ask the current owner of the product to take a live core of your python process after it has become large but not so large that the system is so responsive, you can take a look at that core with [chap](https://github.com/vmware/chap). Just make sure to set the /proc/*your-process-pid*/coredump_filter to 0x37 before taking the core. Then start with **summarize used** or **summarize used /sortby bytes** from the **chap** prompt. — Tim Boddy, May 20 '22 at 23:20

Trouble finding memory leak (tracemalloc/objgraph/gc not helping)

0 Answers0