5

In a Python code that iterates over a sequence of 30 problems involving memory- and CPU-intense numerical computations, I observe that the memory consumption of the Python process grows by ~800MB with the beginning of each of the 30 iterations and finally raises an MemoryError in the 8th iteration (where the system's memory is in fact exhausted). However, if I import gc and let gc.collect() run after each iteration, then the memory consumption remains constant at ~2.5GB and the Python code terminates nicely after solving all 30 problems. The code only uses the data of 2 consecutive problems and there are no reference cycles (otherwise the manual garbage collection would also not be able to keep the memory consumption down).

The question

This behavior raises the question if Python tries to run the garbage collector before it raises an MemoryError. In my opinion, this would be a perfectly sane thing to do but perhaps there are reasons against this?

A similar observation to the above was made here: https://stackoverflow.com/a/4319539/1219479

Community
  • 1
  • 1
andrenarchy
  • 448
  • 3
  • 12
  • possible duplicate of [Details how python garbage collection works](http://stackoverflow.com/questions/4484167/details-how-python-garbage-collection-works) – aruisdante Mar 16 '14 at 17:49
  • The mentioned question and its answer cover garbage collection in general. Here, the question is very specific: is the garbage collector called before a `MemoryError` is raised? – andrenarchy Mar 16 '14 at 17:59

1 Answers1

4

Actually, there are reference cycles, and it's the only reason why the manual gc.collect() calls are able to reclaim memory at all.

In Python (I'm assuming CPython here), the garbage collector's sole purpose is to break reference cycles. When none are present, objects are destroyed and their memory reclaimed at the exact moment the last reference to them is lost.

As for when the garbage collector is run, the full documentation is here: http://docs.python.org/2/library/gc.html

The TLDR of it is that Python maintains an internal counter of object allocations and deallocations. Whenever (allocations - deallocations) reaches 700 (threshold 0), a garbage collection is run and both counters are reset.

Every time a collection happens (either automatic, or manually run with gc.collect()), generation 0 (all objects that haven't yet survived a collection) is collected (that is, objects with no accessible references are walked through, looking for reference cycles -- if any are found, the cycles are broken, possibly leading to objects being destroyed because there are no references left). All objects that remain after that collection are moved to generation 1.

Every 10 collections (threshold 1), generation 1 is also collected, and all objects in generation 1 that survive that are moved to generation 2. Every 10 collections of generation 1 (that is, every 100 collections -- threshold 2), generation 2 is also collected. Objects that survive that are left in generation 2 -- there is no generation 3.

These 3 thresholds can be user-set by calling gc.set_threshold(threshold0, threshold1, threshold2).

What this all means for your program:

  1. The GC is not the mechanism CPython uses to reclaim memory (refcounting is). The GC breaks reference cycles in "dead" objects, which may lead to some of them being destroyed.
  2. No, there are no guarantees that the GC will run before a MemoryError is raised.
  3. You have reference cycles. Try to get rid of them.
Max Noel
  • 8,810
  • 1
  • 27
  • 35
  • Yes, CPython is used. So reference cycles are apparently not the correct term. What I meant is that the data of the first `i-1` iterations is not referenced in the `i+1`st iteration. – andrenarchy Mar 16 '14 at 18:30
  • A reference cycle means that somehow, object a has a reference to object b which, through some arbitrarily-long chain of references, has a reference to object a. Which means that object a won't be immediately deallocated when your *program* no longer has access to it. It will (*might*, actually), only when the GC runs. If you have reference cycles between objects of the `i`th iteration, they won't be destroyed when you start the `i+1`st iteration. – Max Noel Mar 16 '14 at 18:33
  • And his answer answers your question. CPython only does a GC run every 700 allocations. It does not check if it can perform an allocation and if that fails runs an allocation like Java's GC system does. I imagine this is some form of esoteric performance optimization. – aruisdante Mar 16 '14 at 18:33
  • Thanks @max-noel for the explanation, number 2. in your summary thus answers my question. – andrenarchy Mar 16 '14 at 18:43