I have a class that has a cache implemented as a dict
for numpy arrays, which can occupy GBs of data.
class WorkOperations(object):
def __init__(self):
self.data_cache: Dict[str, Dict[str, Tuple[np.ndarray, np.ndarray]]] = {}
def get_data(key):
if key not in data_cache:
add_data(key)
return self.data_cache[key]
def add_data(key)
result = run_heavy_calculation(key)
self.data_cache[key] = result
I am testing the code with this function -
import gc
def perform_operations()
work_operations = WorkOperations()
# input_keys gives a list of keys to process
for keys in input_keys():
data = work_operations.get_data(key)
do_some_operation(data)
del work_operations
perform_operations()
gc.collect()
The result of run_heavy_calculation
is heavy in memory and soon data_cache
grows and occupies memory in GBs (which is expected).
But memory does not get released even after perform_operations()
is done. I tried adding del work_operations
and invoking gc.collect()
but that did not help either. I checked memory of the process after several hours, but the memory was still not freed up.
If I don't use caching (data_cache
) at all (at the cost of latency), memory never goes high.
I am wondering what is it that is taking memory. I tried running tracemalloc
, but it just showed lines occupying memory in KBs. I also took a memory dump with gdb
by looking at memory address from process pmap
and /proc/<pid>/smaps
, but that is really long and even with hexeditor I couldn't figure out much.
I am measuring memory used by the process using top
command and looking at RES
. I also tried outputting memory in the end from within python process as well with -
import psutils
import gc
import logging
GIGABYTE = 1024.0 * 1024.0 * 1024.0
perform_operations()
gc.collect()
memory_full_info = psutil.Process().memory_full_info()
logging.info(f"process memory after running the process {memory_full_info.uss / GIGABYTE}")