6

I am working on a Python script which queries several different databases to collate data and persist said data to another database. This script collects data from potentially millions of records across about 15 different databases. To attempt to speed up the script I have included some caching functionality, which boils down to having a dictionary which holds some frequently queried data. The dictionary holds key value pairs where the key is a hash generated based on the database name, collection name and query conditions and the value is the data retrieved from the database. For example:

{123456789: {_id: '1', someField: 'someValue'}} where 123456789 is the hash and {_id: '1', someField: 'someValue'} is the data retrieved from the database.

Holding this data in a local dictionary means that instead of having to query the databases each time, which is likely slow, I can access some frequently queried data locally. As mentioned, there are a lot of queries so the dictionary can grow pretty large (several gigabytes). I have some code which uses psutil to look at how much memory is available on the machine running the script and if the available memory gets below a certain threshold I clear the dictionary. The code to clear the dictionary is:

cached_documents.clear()
cached_documents = None
gc.collect()
cached_documents = {}

I should point out that cached_documents is a local variable which gets passed into all the methods that either access or add to the cache. Unfortunately, it seems that this isn't enough to free the memory properly as Python is still holding onto a lot of extra memory, even after calling the above code. You can see a profile of the memory usage here:

enter image description here

Of note is the fact that the first few times the dictionary is cleared, we release a lot of memory back the system, but each subsequent time seems to be less, at which point the memory usage flatlines because the cache gets cleared extremely frequently since the available memory is within the threshold since Python is holding onto a lot of memory.

Is there a way to force Python to free the memory properly when clearing the dictionary so that I avoid flat lining? Any tips are appreciated.

martineau
  • 119,623
  • 25
  • 170
  • 301
oreid
  • 1,336
  • 2
  • 13
  • 25
  • 3
    Freeing objects doesn't necessarily return the memory to the OS, so the process size doesn't shrink. It just makes it available for allocation to other Python objects. – Barmar Jun 04 '20 at 00:37
  • 2
    AFAIK, the only way to reliably return memory to the OS is to end the process. – user2357112 Jun 04 '20 at 00:51
  • 2
    Python returns unused object space to its heaps but there is little chance that an entire heap clears so it doesn't even bother to figure out if it can be returned to the system. – tdelaney Jun 04 '20 at 01:04
  • Not really related, but you should try *not* putting your record data into dict objects. That's extremely inefficient. Use `namedtuple`s, or a slotted class, so something like `{hash_value: namedtuple_record}` – juanpa.arrivillaga Jun 04 '20 at 01:08
  • @juanpa.arrivillaga Thanks for your input. How would this work when wanting to look up to retrieve an item from the cache? Presumably you'd need a list of `namedtuple`s which you'd have to iterate over to find the tuple which has a certain `hash_value`? This seems much less efficient than dictionary access. Or have I misunderstood? – oreid Jun 04 '20 at 01:24
  • 1
    No, I meant for the values of the keys in your big, cache dict. Don't use a tuple as your cache, again, it's not really related because your dict will still grow... but if you are worried about memory usage to begin with... so use `Record = namedtuple('Record', 'id some_field')` and then `cached_documents[hash_doc(document)] = Record(id, some_field_val)` etc... – juanpa.arrivillaga Jun 04 '20 at 01:25
  • So for example, on Python 3.8 `sys.getsizeof(Record(None, None))` gives 56, whereas `sys.getsizeof(dict(id=None, some_field=None))` gives `232` – juanpa.arrivillaga Jun 04 '20 at 01:30

1 Answers1

0

Based on the comments on my original post, I made some changes.

As mentioned in the comments, Python does not seem to reliably return memory to the operating system until a process ends. In some applications, this means that you could spin up a seperate process to do your memory intensive work. See Releasing memory in Python for more details.

Unfortunately, this isn't applicable in my case since the whole point is to have the data in memory when its required.

Since Python holds some of the allocated memory and makes it available for other Python objects, I updated the criteria for my script to clear the cache. Instead of basing this on available system memory, I set the conditions to clear the cache based on the cache size. The rationale, is that I can continue filling the cache and reusing this memory that Python is holding. I found the cache size threshold by taking a rough average of the first couple of times the cache was cleared in the graph in my question, then reduced the number slightly to add a bit of leeway (e.g. a cache of size 10 can use different amounts of memory based on whats inside the cache).

This is less safe than clearing the cache based on memory available, because there is the possibility that the cache grows to be bigger than the available memory on the system, causing out of memory errors; especially if other processes run on the system which require lots of memory, however for my use case this was a suitable trade off.

Now with the cache being cleared based on its size rather than available system memory, I seem to be able to take advantage of Python holding onto memory. Although this may not be a perfect answer, in my case, it seems to work.

oreid
  • 1,336
  • 2
  • 13
  • 25