I am working on a Python script which queries several different databases to collate data and persist said data to another database. This script collects data from potentially millions of records across about 15 different databases. To attempt to speed up the script I have included some caching functionality, which boils down to having a dictionary which holds some frequently queried data. The dictionary holds key value pairs where the key is a hash generated based on the database name, collection name and query conditions and the value is the data retrieved from the database. For example:
{123456789: {_id: '1', someField: 'someValue'}}
where 123456789
is the hash and {_id: '1', someField: 'someValue'}
is the data retrieved from the database.
Holding this data in a local dictionary means that instead of having to query the databases each time, which is likely slow, I can access some frequently queried data locally. As mentioned, there are a lot of queries so the dictionary can grow pretty large (several gigabytes). I have some code which uses psutil
to look at how much memory is available on the machine running the script and if the available memory gets below a certain threshold I clear the dictionary. The code to clear the dictionary is:
cached_documents.clear()
cached_documents = None
gc.collect()
cached_documents = {}
I should point out that cached_documents
is a local variable which gets passed into all the methods that either access or add to the cache. Unfortunately, it seems that this isn't enough to free the memory properly as Python is still holding onto a lot of extra memory, even after calling the above code. You can see a profile of the memory usage here:
Of note is the fact that the first few times the dictionary is cleared, we release a lot of memory back the system, but each subsequent time seems to be less, at which point the memory usage flatlines because the cache gets cleared extremely frequently since the available memory is within the threshold since Python is holding onto a lot of memory.
Is there a way to force Python to free the memory properly when clearing the dictionary so that I avoid flat lining? Any tips are appreciated.