32

I have a Python program that runs a series of experiments, with no data intended to be stored from one test to another. My code contains a memory leak which I am completely unable to find (I've look at the other threads on memory leaks). Due to time constraints, I have had to give up on finding the leak, but if I were able to isolate each experiment, the program would probably run long enough to produce the results I need.

  • Would running each test in a separate thread help?
  • Are there any other methods of isolating the effects of a leak?

Detail on the specific situation

  • My code has two parts: an experiment runner and the actual experiment code.
  • Although no globals are shared between the code for running all the experiments and the code used by each experiment, some classes/functions are necessarily shared.
  • The experiment runner isn't just a simple for loop that can be easily put into a shell script. It first decides on the tests which need to be run given the configuration parameters, then runs the tests then outputs the data in a particular way.
  • I tried manually calling the garbage collector in case the issue was simply that garbage collection wasn't being run, but this did not work

Update

Gnibbler's answer has actually allowed me to find out that my ClosenessCalculation objects which store all of the data used during each calculation are not being killed off. I then used that to manually delete some links which seems to have fixed the memory issues.

Community
  • 1
  • 1
Casebash
  • 114,675
  • 90
  • 247
  • 350
  • 2
    define "memory leak" in python. – hasen Oct 29 '09 at 02:02
  • I mean, you can't possibly "forget" to free any memory; it's GCed. – hasen Oct 29 '09 at 02:02
  • How can you tell you have a memory leak? Is it that your process memory grows to a large size, and never shrinks? If so, be advised that Python doesn't necessarily return memory to the OS just because it's no longer using it. – David Seiler Oct 29 '09 at 02:03
  • 2
    It not only grows, but keeps on getting larger – Casebash Oct 29 '09 at 02:28
  • Plus if I run only a single case, rather than running all of my cases, it works – Casebash Oct 29 '09 at 02:30
  • @hasen There are a few things that prevent the gc. Defining a `__del__` method is one example – John La Rooy Oct 29 '09 at 02:32
  • It is possible to leak memory in Python if you have stray references sitting around into a data structure (e.g. a class that remembers its instances, or a results array that contains references to tree nodes that themselves contain references into the rest of the tree). The GC will also be unable to collect objects if your classes have both `__del__` methods and circular references. – Theran Oct 29 '09 at 02:32
  • Casebash, do you use memoization? – unutbu Oct 29 '09 at 02:55
  • Yes, but that should only be storing strings and integers. I will double check – Casebash Oct 29 '09 at 03:05
  • And it is only used by the experiment runner not the experiments – Casebash Oct 29 '09 at 03:05
  • What are the arguments to the method/function getting memoized? Typically, memoization works by saving a dictionary which maps args to computed values. If an object associated with an experiment is used as an argument to the memoized method, then that object may never get garbage collected. – unutbu Oct 29 '09 at 03:27
  • You should accept gnibbler's answer. It's not only a good answer, it actually solved your problem! – steveha Oct 29 '09 at 05:55
  • 3
    You can have a "memory leak" in just about any programming language: *even one with a GC* -- it is simply the reduction of available memory caused by lack of reclamation of previous allocations (which in a GC-enabled language simply means that [and increasing amount of] objects are never made eligible for reclamation). That's right, a GC makes programming easier by eliminating huge burden, it doesn't eliminate bad code or poor designs. –  Oct 29 '09 at 06:05

4 Answers4

69

You can use something like this to help track down memory leaks

>>> from collections import defaultdict
>>> from gc import get_objects
>>> before = defaultdict(int)
>>> after = defaultdict(int)
>>> for i in get_objects():
...     before[type(i)] += 1 
... 

now suppose the tests leaks some memory

>>> leaked_things = [[x] for x in range(10)]
>>> for i in get_objects():
...     after[type(i)] += 1
... 
>>> print [(k, after[k] - before[k]) for k in after if after[k] - before[k]]
[(<type 'list'>, 11)]

11 because we have leaked one list containing 10 more lists

John La Rooy
  • 295,403
  • 53
  • 369
  • 502
4

Threads would not help. If you must give up on finding the leak, then the only solution to contain its effect is running a new process once in a while (e.g., when a test has left overall memory consumption too high for your liking -- you can determine VM size easily by reading /proc/self/status in Linux, and other similar approaches on other OS's).

Make sure the overall script takes an optional parameter to tell it what test number (or other test identification) to start from, so that when one instance of the script decides it's taking up too much memory, it can tell its successor where to restart from.

Or, more solidly, make sure that as each test is completed its identification is appended to some file with a well-known name. When the program starts it begins by reading that file and thus knows what tests have already been run. This architecture is more solid because it also covers the case where the program crashes during a test; of course, to fully automate recovery from such crashes, you'll want a separate watchdog program and process to be in charge of starting a fresh instance of the test program when it determines the previous one has crashed (it could use subprocess for the purpose -- it also needs a way to tell when the sequence is finished, e.g. a normal exit from the test program could mean that while any crash or exit with a status != 0 signify the need to start a new fresh instance).

If these architectures appeal but you need further help implementing them, just comment to this answer and I'll be happy to supply example code -- I don't want to do it "preemptively" in case there are as-yet-unexpressed issues that make the architectures unsuitable for you. (It might also help to know what platforms you need to run on).

Alex Martelli
  • 854,459
  • 170
  • 1,222
  • 1,395
3

I had the same problem with a third party C library which was leaking. The most clean work-around that I could think of was to fork and wait. The advantage of it is that you don't even have to create a separate process after each run. You can define the size of your batch.

Here's a general solution (if you ever find the leak, the only change you need to make is to change run() to call run_single_process() instead of run_forked() and you'll be done):

import os,sys
batchSize = 20

class Runner(object):
    def __init__(self,dataFeedGenerator,dataProcessor):
        self._dataFeed = dataFeedGenerator
        self._caller = dataProcessor

    def run(self):
        self.run_forked()

    def run_forked(self):
        dataFeed = self._dataFeed
        dataSubFeed = []
        for i,dataMorsel in enumerate(dataFeed,1):
            if i % batchSize > 0:
                dataSubFeed.append(dataMorsel)
            else:
                self._dataFeed = dataSubFeed
                self.fork()
                dataSubFeed = []
                if self._child_pid is 0:
                    self.run_single_process()
                self.endBatch()

    def run_single_process(self)
        for dataMorsel in self._dataFeed:
            self._caller(dataMorsel)

    def fork(self):
        self._child_pid = os.fork()

    def endBatch(self):
        if self._child_pid is not 0:
            os.waitpid(self._child_pid, 0)
        else:
            sys.exit() # exit from the child when done

This isolates the memory leak to the child process. And it will never leak more times than the value of the batchSize variable.

Dmitry Rubanovich
  • 2,471
  • 19
  • 27
2

I would simply refactor the experiments into individual functions (if not like that already) then accept an experiment number from the command line which calls the single experiment function.

The just bodgy up a shell script as follows:

#!/bin/bash

for expnum in 1 2 3 4 5 6 7 8 9 10 11 ; do
    python youProgram ${expnum} otherParams
done

That way, you can leave most of your code as-is and this will clear out any memory leaks you think you have in between each experiment.

Of course, the best solution is always to find and fix the root cause of a problem but, as you've already stated, that's not an option for you.

Although it's hard to imagine a memory leak in Python, I'll take your word on that one - you may want to at least consider the possibility that you're mistaken there, however. Consider raising that in a separate question, something that we can work on at low priority (as opposed to this quick-fix version).

Update: Making community wiki since the question has changed somewhat from the original. I'd delete the answer but for the fact I still think it's useful - you could do the same to your experiment runner as I proposed the bash script for, you just need to ensure that the experiments are separate processes so that memory leaks dont occur (if the memory leaks are in the runner, you're going to have to do root cause analysis and fix the bug properly).

paxdiablo
  • 854,327
  • 234
  • 1,573
  • 1,953
  • I did consider just writing a shell script, but unfortunately my experimental code is much more complex than that – Casebash Oct 29 '09 at 02:44
  • 1
    @paxdiablo: I agree that this answer should be left as it could be helpful for anyone else who visits this question – Casebash Oct 29 '09 at 11:51