Python multiproccessing memory increase

Question

I have a program that should run forever. Here is what I am doing:

from myfuncs import do, process

class Worker(multiprocessing.Process):

    def __init__(self, lock):
        multiprocesing.Process.__init__(self)
        self.lock = lock
        self.queue = Redis(..) # this is a redis based queue
        self.res_queue = Redis(...)

     def run():
         while True:
             job = self.queue.get(block=True)
             job.results = process(job)
             with self.lock:
                 post_process(self.res_queue, job)


def main():
    lock = multiprocessing.Semaphore(1)
    ps = [Worker(lock) for _ in xrange(4)]
    [p.start() for p in ps]
    [p.join() for p in ps]

self.queue and self.res_queue are two objects that work similar like python stdlib Queue but they use Redis database as a backend.

Function process does some processing to the data that job carries( mostly html parsing ) and returns a dictionary.

Function post_process writes the job to another redis queue by checking some criteria (only one process at a time can check for the criteria that's why the lock). It returns True/False.

The memory used by the program everyday is increasing. Can somebody figure out what is going on?

Memory should be free when job get outs of scope in the run method correct?

What makes you sure that it's the job objects that are being retained? Are you using `tracemalloc`, or scanning the gc heap in the debugger, or just guessing? — abarnert, Sep 12 '14 at 07:47

abarnert · Answer 1 · 2014-09-12T17:44:37.757

Memory should be free when job get outs of scope in the run method correct?

First, the scope is the entire run method, which loops forever, so that never happens. (Besides, when you exit the run method, the process shuts down and its memory is freed anyway…)

But even if it did go out of scope, that wouldn't mean what you seem to think it means. Python isn't like C++, where there are variables whose storage is on the stack. All objects live on the heap, and they stay alive until there are no more references to them. A variable falling out of scope means that variable is no longer referring to whatever object it used to be referring to. If that variable was the only reference to the object, then it will be freed*, but if there are other references that you've made elsewhere, the object can't be freed until those other references go away.

Meanwhile, there's nothing magical about going out of scope. Any way a variable stops referring to an object has the same effect—whether it's the variable going out of scope, you calling del on it, or you assigning a new value to it. So, each time through the loop, when you do job =, you're dropping the previous reference to job even though nothing ever went out of scope. (But keep in mind that you will have two jobs alive at peak, not one, because the new one is pulled off the queue before the old one is released. If that's an issue, you can always do job = None before blocking on the queue.)

So, assuming the problem actually is the job object (or something it owns), the problem is that some of the code you haven't shown us is keeping a reference to it around somewhere.

Without knowing what you're doing, it's hard to suggest a fix. It may just be "don't store that there". Or it may be "store a weakref instead of the object itself". Or "add an LRU algorithm". Or "add some flow control so if you get too backed up you don't keep piling on work until you run out of memory".

_{* In CPython, this happens immediately, because the garbage collector is based on refcounting. In Jython and IronPython, on the other hand, the garbage collector just relies on the underlying VM's garbage collector, so the object isn't freed until the JVM or CLR notices that it's no longer being referenced, which is generally not immediate, and nondeterministic.}

I am pretty sure the the process and post_process does not keep any reference. Process accepts the job object get an attribute containing a string parses that string and returns the parsed results as a string(actually returns zlib.compress(json.dumps(result)). I found a similar issue here:http://stackoverflow.com/questions/21485319/high-memory-usage-using-python-multiprocessing. Also here: http://python.dzone.com/articles/diagnosing-memory-leaks-python in the explanation says that: long-running Python jobs onsume a lot of memory while running ot return memory to OS until process terminates — gosom, Sep 12 '14 at 07:58
@gosom: It's true that CPython almost never releases memory of the OS, so if your memory usage ever spikes, that peak will be your memory usage until you exit. But, at least in 64-bit-land, that's often not a problem anyway; if that extra memory never gets touched, and any other process has any need for it, it'll just get swapped out and never swapped back in anyway, so you're really just wasting 1MB of page table space, not 12GB of active memory. So if you think that's what's going on, make sure the problem is affecting performance or stability before wasting too much debugging time on it… — abarnert, Sep 12 '14 at 17:41
Anyway, if you actually _are_ retaining garbage, that can be a bigger problem that just retaining unused pages of heap. If you need to debug that, Python has some tools to help do so like the `gc` module and (with 3.4+) `tracemalloc`, and there are plenty of third-party Python modules and external tools that can help as well. Unless your job objects are progressively getting bigger over time, or there's a leak in `multiprocessing` itself, you're retaining something somewhere. — abarnert, Sep 12 '14 at 17:49
One last thing: What version of Python are you using (the whole X.Y.Z, not just X.Y), on what platform? Because I vaguely remember there _was_ a semi-serious leak in multiprocessing itself on non-OS X POSIX systems that got fixed in 2.7.x and 3.3.y, or something like that… so it may be worth upgrading to 3.4 or 2.7.8 or whatever if appropriate to see if it makes a difference. — abarnert, Sep 12 '14 at 17:51

dano · Answer 2 · 2014-09-12T17:50:56.040

If you can't find the source of the leak, you can work around it by having each of your workers only process a limited number of tasks. Once they've hit the task limit, you can then have them exit, and replace them with a new worker process. The built-in multiprocessing.Pool object supports this via the maxtasksperchild keyword argument. You could do something similar:

import multiprocessing
import threading

class WorkerPool(object):
    def __init__(self, workers=multiprocessing.cpu_count(),
                 maxtasksperchild=None, lock=multiprocessing.Semaphore(1)):
        self._lock = multiprocessing.Semaphore(1)
        self._max_tasks = maxtasksperchild
        self._workers = workers
        self._pool = []
        self._repopulate_pool()
        self._pool_monitor = threading.Thread(self._monitor_pool)
        self._pool_monitor.daemon = True
        self._pool_monitor.start()

    def _monitor_pool(self):
        """ This runs in its own thread and monitors the pool. """
        while True:
            self._maintain_pool()
            time.sleep(0.1)

    def _maintain_pool(self):
        """ If any workers have exited, start a new one in its place. """
        if self._join_exited_workers():
            self._repopulate_pool()

    def _join_exited_workers(self):
        """ Find exited workers and join them. """
        cleaned = False
        for i in reversed(range(len(self._pool))):
            worker = self._pool[i]
            if worker.exitcode is not None:
                # worker exited
                worker.join()
                cleaned = True
                del self._pool[i]
        return cleaned

    def _repopulate_pool(self):
        """ Start new workers if any have exited. """
        for i in range(self._workers - len(self._pool)):
            w = Worker(self._lock, self._max_tasks)
            self._pool.append(w)
            w.start()    


class Worker(multiprocessing.Process):

    def __init__(self, lock, max_tasks):
        multiprocesing.Process.__init__(self)
        self.lock = lock
        self.queue = Redis(..) # this is a redis based queue
        self.res_queue = Redis(...)
        self.max_tasks = max_tasks

     def run():
         runs = 0
         while self.max_tasks and runs < self.max_tasks:
             job = self.queue.get(block=True)
             job.results = process(job)
             with self.lock:
                 post_process(self.res_queue, job)
            if self.max_tasks:
                 runs += 1


def main():
    pool = WorkerPool(workers=4, maxtasksperchild=1000)
    # The program will block here since none of the workers are daemons.
    # It's not clear how/when you want to shut things down, but the Pool
    # can be enhanced to support that pretty easily.

Note that the pool monitoring code above is almost exactly the same as the code that's used in multiprocessing.Pool for the same purpose.

Nice explanation. One variation I've found useful once or twice is to make the processes recycle when they reach a certain peak memory usage instead of after a certain number of tasks. (In fact, that's one of the rare reasons I've had for writing my own pool instead of just using `futures` or `multiprocessing`…) — abarnert, Sep 12 '14 at 17:46
@dano thanks, I tried something similar and it works. This is a workaround. When I have time I will try to figure out why this happens — gosom, Sep 12 '14 at 19:46
@gosom: it's always worth finding out... But if it turns out to be a bug in 2.7.6 multiprocessing and you can't upgrade to 2.7.8, or it's something implicit in your design that you can't change, etc., this could end up being your permanent answer anyway. — abarnert, Sep 12 '14 at 22:56

Python multiproccessing memory increase

2 Answers2