I am working with a very large matrix (~ 30-100 GB) that is needed for clustering. I am trying to optimize memory usage during my program, and I want to delete it after it's not needed anymore. I tried the following code:
# create distance matrix
D = pairwise_distances(X = embeddings, metric = 'cosine', n_jobs = 1)
# check memory after creating matrix
usage1 = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
# run clustering
clustering = DBSCAN(
eps = 0.5,
min_samples = 1,
metric = 'precomputed'
).fit_predict(D)
# delete matrix
del D
gc.collect()
# check memory again
usage2 = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
What I noticed, is that memory is still roughly the same (usage1
~ usage2
). I've read a bit about how garbage collection works in python, and how python creates 'free-lists' for faster integer allocation, but I am not sure if it applies to this case. If I understand that correctly, python will keep memory blocks reserved for the same data-type, once it has been allocated and the only way to free up that memory is to end the program?
I've seen some articles that work around this issue using multi processes/subprocesses, but since I am working in a jupyter notebook in watson studio, it is not possible for me (as far as I have tried, multi-processing with python notebooks were not working). article 1 and article 2 So how do you free up memory explicitly, by deleting an object (a numpy array) from memory?
As I said, I am running Python in a Jupyter Notebook in Watson Studio and using Python 3.7. Let me know if you need more detail about the environment, and I will be happy to provide it.