17

I have been dabbling with Python's multiprocessing library and although it provides an incredibly easy to use API, it's documentation is not always very clear. In particular, the argument 'maxtasksperchild' passed to an instance of the Pool class I find very confusing.

The following comes directly from Python's documentation (3.7.2):

maxtasksperchild is the number of tasks a worker process can complete before it will exit and be replaced with a fresh worker process, to enable unused resources to be freed. The default maxtasksperchild is None, which means worker processes will live as long as the pool.

The above raises more questions for me than it answers. Is it bad for a worker process to live as long as the pool? What makes a worker process 'fresh' and when is that desired? In general, when should you set the value for maxtasksperchild explicitly instead of letting it default to 'None' and what are considered best practices in order to maximize processing speed?

From @Darkonaut's amazing answer on chunksize I now understand what chunksize does and represents. Since supplying a value for chunksize impacts the number of 'tasks', I was wondering if there are any considerations that should be made regarding their dependence to ensure maximum performance?

Thanks!

Marnix.hoh
  • 1,556
  • 1
  • 15
  • 26

1 Answers1

28

Normally you don't need to touch this. Sometimes there can arise problems with code calling outside Python leaking memory for example. Limiting the number of tasks a worker-process does before he gets replaced then helps because the "unused resources" he erroneously accumulates are released when the process gets scrapped. Starting a new, "fresh" process then keeps the problem contained. Because replacing a process needs time, for performance you let maxtasksperchild at default. When you run into unexplainable resource problems some day, you can try setting maxtasksperchild=1 to see if this changes something. If it does, it's likely something is leaking something.

Darkonaut
  • 20,186
  • 7
  • 54
  • 65
  • Thank you very much for your quick and clear answer @Darkonaut! I was secretly hoping that YOU would see my question and answer it, since you appear to be the main expert here on SO regarding python's multiprocessing.pool class. Thanks again! – Marnix.hoh Mar 04 '19 at 01:47
  • 1
    @Marnix.hoh You're welcome! Pretty sure your phrase about the "expert" is not true, but thanks for your feedback ;) – Darkonaut Mar 04 '19 at 01:57
  • Haha I think you're too modest ;). I have another question actually and I was wondering if you happen to know the answer of top of your head. I want to use the pool.map(), to apply a function on a list of objects, where the function modifies a property on each of the objects. Is there a way to make this work using 'map' or should I use a different method on 'pool'? – Marnix.hoh Mar 04 '19 at 02:58
  • @Marnix.hoh The objects will be copied as soon as you use `multiprocessing.Pool` with _any_ pool-method, so you don't modify the object you have in your parent, you create new objects in your worker processes. – Darkonaut Mar 04 '19 at 03:28
  • Yeah that's what I found as well. So there is no quick way to pass them to Pool.map() by reference? I am kind of stuck here. I tried looking into thinks like managers, proxies, and using shared memory methods like Array() and Value(), but none of these seem to work... Do you have any ideas on how to pass objects to Pool() effectively by reference? Thank you so much – Marnix.hoh Mar 04 '19 at 21:07
  • 1
    @Marnix.hoh It's not really clear to me what you want to achieve or what your needs are so I cannot just point you to one solution that will work for you. Different processes don't share their memory by default, so there cannot be just references passed because each process has its own virtual address space. In case your objects are small and only one process needs to modify one and the same object once, you could just pass it within the `iterable` you pass into `pool.map()`, let the specified function call your method on it and let it return the object... – Darkonaut Mar 04 '19 at 22:25
  • 2
    @Marnix.hoh... There will be copying, but that must not be a problem in every scenario. If you need multiple processes modify one and the same complex object, using managers and proxies might be an option, or you look into something like [ray](https://github.com/ray-project/ray). – Darkonaut Mar 04 '19 at 22:26
  • I ran some timed tests and I think I settled on my 'solution'. To expand a little bit on what I want to do: I have a list of instances of the same class and I want to apply a function on each in order to update the same property on each instance. Preferably I want to pass these instances by reference to map, but since that apparently is not an option, I instead do pass the list of instances to map to return a list of only the updated property values. After this I then update the properties of the instances in the list using the results returned by map. – Marnix.hoh Mar 04 '19 at 22:54
  • according to some speed tests that I ran, the last step (ie the updating of the properties of the instances in the list) is pretty fast compared to the actual map() and thus I think that this way of doing things is acceptable. Please let me know if you have any suggestions. – Marnix.hoh Mar 04 '19 at 22:55
  • 1
    @Marnix.hoh Can't really suggest something else you could implement that easy. In case your OS does support forking (not Windows), you could let your worker processes inherit your objects as globals, that way they wouldn't have to be send over a queue for the in-way. But that would be non-trivial and ugly to implement with Pool (over `initializer`-parameter and some coordination...). Easier with `multiprocessing.Process` and a `multiprocessing.Queue` for returning the values ([example](https://stackoverflow.com/a/53432672/9059420)). – Darkonaut Mar 05 '19 at 01:51
  • You can also do something like, add an `id` attribute to the instances, and pass a list of IDs to map. Then each subprocesses gets the instances with the given IDs and processes those. That way you pass a simple data type (int) to map, which is fast, and you let the subprocesses do the work of getting and processing the instances. – Bram Vanroy Apr 29 '19 at 12:18