1

I am working with a very large matrix (~ 30-100 GB) that is needed for clustering. I am trying to optimize memory usage during my program, and I want to delete it after it's not needed anymore. I tried the following code:

# create distance matrix
D = pairwise_distances(X = embeddings, metric = 'cosine', n_jobs = 1)

# check memory after creating matrix
usage1 = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss

# run clustering
clustering = DBSCAN(
        eps = 0.5,
        min_samples = 1,
        metric = 'precomputed'
    ).fit_predict(D)

# delete matrix
del D
gc.collect()

# check memory again
usage2 = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss

What I noticed, is that memory is still roughly the same (usage1 ~ usage2). I've read a bit about how garbage collection works in python, and how python creates 'free-lists' for faster integer allocation, but I am not sure if it applies to this case. If I understand that correctly, python will keep memory blocks reserved for the same data-type, once it has been allocated and the only way to free up that memory is to end the program?

I've seen some articles that work around this issue using multi processes/subprocesses, but since I am working in a jupyter notebook in watson studio, it is not possible for me (as far as I have tried, multi-processing with python notebooks were not working). article 1 and article 2 So how do you free up memory explicitly, by deleting an object (a numpy array) from memory?

As I said, I am running Python in a Jupyter Notebook in Watson Studio and using Python 3.7. Let me know if you need more detail about the environment, and I will be happy to provide it.

  • 2
    del is supposed to do that correctly. However if any reference to D exists in the current scope, then D will not be collected. Note that by default gc.collect collects the generation 2. I am not sure it collect the generation 0 and 1. You could try to do that yourself by specifying the parameter 0 and 1 in 2 other calls. Finally, regarding the system-side allocator, the GC can free the memory but the allocator could not give the memory back to the system for performance purposes. In such a case, using a less conservative allocator may help. – Jérôme Richard Jun 16 '21 at 19:59
  • thanks, what do you mean by 'if any reference to D exists in the current scope'? also, what do you mean by using a less conservative allocator? if gc frees the memory, does that mean that I can use that memory, however when I check resident set size it will be shown as taken? – szutsmester Jun 17 '21 at 12:43
  • I do not know what `fit_predict` does but if a reference to `D` is stored for example in `clustering` internally or any other data structure, the GC will not collect it. That being said, this is probably not the case here. On Linux, the usual allocator is the one of the glibc which request some buffers to the OS and tends not to free them for performance purpose. Other allocators do that (TCMalloc of Google preallocate huge buffers ahead of time). You can try other allocators than the default one like jemalloc for example. – Jérôme Richard Jun 17 '21 at 20:40
  • 1
    AFAIK, the CPython GC uses the standard malloc/free functions of the libc of the platform. If the libc implementation would always map/unmap the memory to the OS, then the memory consumption would probably be significantly smaller, but most library do not do that because the map/unmap syscalls are very expensive. I think it is not easy to find how much memory has been released from the GC but still mapped by the OS because of the libc allocator. AFAIK jemalloc can provide statistics in debug mode and so may be able to report such interesting metrics. – Jérôme Richard Jun 17 '21 at 20:47

0 Answers0