4

I have a big pickle file containing hundreds of trained r-models in python: these are stats models built with the library rpy2.

I have a class that loads the pickle file every time one of its methods is called (this method is called several times in a loop). It happens that the memory required to load the pickle file content (around 100 MB) is never freed, even if there is no reference pointing to loaded content. I correctly open and close the input file. I have also tried to reload pickle module (and even rpy) at every iteration. Nothing changes. It seems that just the fact of loading the content permanently locks some memory.

Marco Mene
  • 397
  • 2
  • 6
  • 11
  • Related? http://stackoverflow.com/questions/16288936/how-do-i-prevent-memory-leak-when-i-load-large-pickle-files-in-a-for-loop?rq=1 – Daenyth Dec 10 '15 at 16:31
  • I had already read that. My case is different. In that case there's a reference pointing to the loaded content. I do something like: `with open(trained_models_file, 'r') as file_: pickle.load(file_) ` The memory should be freed when the call to the method finishes – Marco Mene Dec 10 '15 at 16:37
  • I don't think the GC isn't guaranteed to be called at any particular point, even if a resource is free. – Daenyth Dec 10 '15 at 16:42
  • Of course it's not guaranteed. But it should definitively be called before a memory leaks and the successive program termination occur. Anyway, I memory-profiled the code and seen that the GC is called, cause part of the other memory used by the program gets freed. But not the memory linked to pickle loading this file. – Marco Mene Dec 10 '15 at 16:45
  • If you make a single script that just loads the pickles in an endless loop, with each step sleeping for some amount of time, do you observe the leak? – Daenyth Dec 10 '15 at 17:49
  • @MarcoMene: this might be an issue, therefore worthy of an entry in rpy2's issue tracker, but a small snippet to reproduce the issue would go long way to have it looked at quickly. – lgautier Dec 10 '15 at 21:55
  • @Daenyth: yes I observe the leak. @Daenyth @Igautier: here is a simple code to reproduce the leak. Already tested in my machine. ` import pickle def test_memory_leak(): print "\n\n test test_memory_leak" file_name = '/Users/marcomeneghelli/Git/crystal-api/datascience/trained_models/trained_cat_grossing_model.txt' while True: with open(file_name, 'r') as file_: pickle.load(file_) if __name__ == '__main__': test_memory_leak() ` – Marco Mene Dec 11 '15 at 08:47
  • Eventually it causes a stack overflow: ` /System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py:1133: UserWarning: Error: protect(): protection stack overflow value = func(*args) /System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py:1133: UserWarning: During startup - value = func(*args) /System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py:1133: UserWarning: Warning message: value = func(*args) Process finished with exit code 134 ` – Marco Mene Dec 11 '15 at 08:50
  • @MarcoMene File a bug ticket with that – Daenyth Dec 11 '15 at 13:15

2 Answers2

4

I can reproduce the issue, and this is now an open issue in the rpy2 issue tracker: https://bitbucket.org/rpy2/rpy2/issues/321/memory-leak-when-unpickling-r-objects

edit: The issue is resolved and the fix is included in rpy2-2.7.5 (just released).

lgautier
  • 11,363
  • 29
  • 42
  • Great! What was the issue about? What was the problem? When are you releasing the debugged version? – Marco Mene Dec 11 '15 at 16:54
  • The R object extracted from the pickle was not freed properly (in rpy2's C code). rpy2 2.7.5 is now released and on pypi. – lgautier Dec 11 '15 at 21:08
0

If you follow this advice, please do so tentatively because I am not 100% sure of this solution but I wanted to try to help you if I could.

In Python the garbage collection doesn't use reference counting anymore, which is when Python detects how many objects are referencing an object, then removes it from memory when objects no longer are referencing it.

Instead, Python uses scheduled garbage collection. This means Python sets a time when it garbage collects instead of doing it immediately. Python switched to this system because calculating references can slow programs down (especially when it isn't needed)

In the case of your program, even though you no longer point to certain objects Python might not have come around to freeing it from memory yet, so you can do so manually using:

gc.enable() # enable manual garbage collection
gc.collect() # check for garbage collection

If you would like to read more, here is the link to Python garbage collection documentation. I hope this helps Marco!

mmghu
  • 595
  • 4
  • 15
  • Thanks for the advice. Nevertheless, already tried this with no effects. – Marco Mene Dec 10 '15 at 16:59
  • Reference counting is still used very much in CPython. You'll find the following in the beginning of the documentation page you are using: "the collector supplements the reference counting already used in Python". Beside that `gc.enable()` does not enable manual garbage collection but *automatic* garbage collection and is absolutely not needed to run `gc.collect()`. – lgautier Dec 10 '15 at 21:52