I have a network process that communicates over TCP collecting data, deserialising it and finally storing parts in a LevelDB key/value store (via Plyvel).
Slowly over time it will consume all available memory to the point that my whole system locks up (Ubuntu 18.04). I'm trying to diagnose the cause but I'm running out of ideas how to investigate further.
The main suspects I have in mind is the data streams we operate on to deserialise the objects. The general thing done there is: receive data from an asyncio.StreamReader and call deserialize_from_bytes
(see next)
@classmethod
def deserialize_from_bytes(cls, data_stream: Union[bytes, bytearray]):
""" Deserialize object from a byte array. """
br = BinaryReader(stream=data_stream)
inv_payload = cls()
try:
inv_payload.deserialize(br)
except ValueError:
return None
finally:
br.cleanup()
return inv_payload
where BinaryReader
initialises like this
def __init__(self, stream: Union[io.BytesIO, bytes, bytearray]) -> None:
super(BinaryReader, self).__init__()
if isinstance(stream, (bytearray, bytes)):
self._stream = io.BytesIO(stream)
else:
self._stream = stream
and cleanup()
is a convenience wrapper for self._stream.close()
What I've tried:
I started with trying out the display top 10 snippet of
tracemalloc
. At a point in time where/proc/$mypid/status
shows a memory usage of 2GB (byVmSize
), the most consuming item fromtracemalloc
reports a mere 38MB, followed by the second and third being 3MB and 360Kb.This already raises a question; where is the other ~1.958 GB?
Not helped by the above output, I tried objgraph. I break into the process with
pdb
and use
import objgraph
objgraph.show_most_common_types(limit=20)
to get
function 16451
tuple 11456
dict 10371
weakref 3058
list 2893
cell 2446
Traceback 2277
Statistic 2277
_Binding 2109
getset_descriptor 1814
type 1680
builtin_function_or_method 1469
wrapper_descriptor 1311
method_descriptor 1284
frozenset 992
property 983
module 810
ModuleSpec 808
SourceFileLoader 738
Attrs 593
I don't see any objects specific to my program in that list (which could indicate references not being released). From other objgraph
examples I found the above counts don't seem out of the ordinary.
I inspected a couple of function
objects and always found something similar to this (asyncio related) which doesn't suggest any memory is leaking there
(Pdb) objgraph.at(0x10f309b70)
<function _run_coroutine.<locals>.step_next.<locals>.continue_ at 0x10f309b70>
Please correct me here if I'm missing something. Again, no step further
- A final try to see if I can enforce any memory being released I manually run gc.collect(). Repeatedly calling it gives values between 25-500 for number of unreachable objects. It doesn't trigger me as an issue (should it?). I also tried running the program with
gc.set_debug(gc.DEBUG_LEAK)
but that's generates so much output I can't make anything of it.
Any tips what I can try from here on?