2

I'm using multiprocessor.Pool to parallelize processing some files. The code waits for a file to be received, then sends that file to a worker using Pool.apply_async, which then processes the file.

This code is supposed to be always running, therefore I don't ever close the pool. This however causes the pool to consume a lot of memory over time.

The code is something like this:

if __name__ == "__main__":
    with Pool(processes=PROCESS_COUNT) as pool:
        while True:
            f = wait_for_file()
            pool.apply_async(process_file, (f,))

How can I prevent high memory usage from happening without closing the pool?

user
  • 167
  • 7

1 Answers1

2

Yes, if you allocate resources and you don't deallocate them be it number of spawned processes or simply (a chunk of) memory, you'll have less resources for other tasks on your machine until you or your system willingly or forcefully deallocate them.

You may want to use maxtasksperchild argument for Pool to attempt killing the slaves e.g. if they allocate memory and you have a leak somewhere, so you save at least some resources.

Note: Worker processes within a Pool typically live for the complete duration of the Pool’s work queue. A frequent pattern found in other systems (such as Apache, mod_wsgi, etc) to free resources held by workers is to allow a worker within a pool to complete only a set amount of work before being exiting, being cleaned up and a new process spawned to replace the old one. The maxtasksperchild argument to the Pool exposes this ability to the end user.

Alternatively, don't roll your own implementation of Pool because until you get there it'll be buggy and you'll unnecessarily burn the time. Instead use e.g. Celery (tutorial) which hopefully even has tests for nasty corner-cases you might spend more time on than necessary.

Or, if you want to experiment a bit, here is a similar question which provides steps to custom slave pool management.

Peter Badida
  • 11,310
  • 10
  • 44
  • 90
  • I get that, but how do I deallocate the resources used by the worker when it finishes in this instance, without using `pool.close()`? – user Jul 16 '21 at 15:55
  • 1
    You don't really, unless you write your own (auto)scaler (high-level e.g. like in Kubernetes) by a custom measurement. [`Pool`](https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.Pool) is basically a convenient wrapper for manual [`Process`](https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Process) spawning. If you want to have scaling abilities, you'll need to implement them based on the logic you want it to behave in. – Peter Badida Jul 16 '21 at 15:58
  • @user [ref for k8s autoscaler concept](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/) to give you an idea. – Peter Badida Jul 16 '21 at 16:05
  • There is something I still don't get. Why are more and more resources allocated to the processes. When `process_file()` finishes running, shouldn't it's garbage be collected on the worker process, like it is on the main process? – user Jul 16 '21 at 16:10
  • 1
    @user That really depends on the implementation of `process_file` function **and** its dependencies. Some packages/people love to do context stuff as globals hence causing a leak perhaps even intentionally. In the approach like a pool of processes however, the leak is just scaled by the number of processes, which can make the leak huge. – Peter Badida Jul 16 '21 at 16:13