0

I have a python class something like the following

from multiprocessing import Pool

class MapFitter():
  
  def __init__(path):
    # load a big data file
    self.data = load_large_data_file(path)
 
  def fit_model(data):
    # fit some model to some data
    model.fit(data)
    return model

  def main():
    # fit models to everything in self.data
    with Pool() as p:
      models = p.map(self.fit_model, self.data)
    return models

And then run it with the following:

fitter = MapFitter(path)
if __name__ == "__main__":
  models = fitter.main()

So basically self.data contains a load of variables that all need some model fitting to them, so this aims to parallelise that process. My question is, when p.map is used in main, is a new class instance created in all the workers and __init__ called several more times? I can't seem to a find an answer to this question, but when using this on data not stored locally it runs much slower, and I can see in task manager the worker processes spin up and then start hitting the network, suggesting they are each reloading the data again.

(I am new to multiprocessing so please suggest a better way to do this if I'm doing something obviously wrong)

  • No, `__init__` won't be called several times. Multiprocessing (as the name suggests) spawns multiple processes. And processes copy memory that they access. Meaning your `self.data` will be copied several times, even when you only read from it (because in Python there's a ref counter that gets incremented). But if you see unintentional network activity in each process, then the problem is elsewhere, maybe in `model.fit(data)` call. – freakish Jul 04 '22 at 09:58
  • @freakish: It's worse than that; passing `self.fit_model` to `pool.map` means the instance is pickled and sent over IPC as part of every task, unpickled in the child, and that copy is used instead of the copy it inherited via copy-on-write mappings. The original global copy might not be copied (if it's a huge `numpy` array, refcounts aren't involved in the individual data items, so it might copy a page of memory for the object header, but not the whole array if you never use it), but you make per-task copies via `unpickle`ing. – ShadowRanger Jul 04 '22 at 10:01
  • @ShadowRanger fair enough. Still, this doesn't answer the network activity issue. I'm not sure if that is a duplicate, assuming OP's observation is correct. – freakish Jul 04 '22 at 10:02
  • @freakish Ok thanks. Really not sure where this network activity is then... the hunt continues! – Dan Kingswell Jul 04 '22 at 10:09
  • Useful [link](https://stackoverflow.com/questions/72708828/how-to-create-the-attribute-of-a-class-object-instance-on-multiprocessing-in-pyt/72711246#72711246) and [link](https://stackoverflow.com/questions/72722711/processpoolexecutor-does-not-mutate-instance-variable-when-submitting-instance-m/72726998#72726998). This question should not have been closed in my opinion, atleast not with the linked question as a duplicate – Charchit Agarwal Jul 04 '22 at 14:25
  • @DanKingswell What os are you on? The assumed duplicate does not answer your question if you are on windows, because fork is not a method windows supports. This means that copy on write does not even exist in windows multiprocessing. I would give a better answer if the question is opened, but basically you are *not* right in your assumption that __init__ is being called again and again, but the data is still being transferred to each process. If you do not want that, you need to use multiprocessing.managers – Charchit Agarwal Jul 04 '22 at 14:37
  • @Charchit I'm using windows, I've read that multiprocessing isn't perfect on windows, but yeah not an expert on why. VScode allowed me to debug for each individual process and according to that, each process was calling __init__ again and again. Calling the class in the name==main guard stops this and "solves" my problem. (Though I'm still not sure that I'm approaching this in the right way) – Dan Kingswell Jul 06 '22 at 09:14
  • I had assumed you were calling the `if __name__...` guard in the first place (I am actually not quite sure how *not* doing that did not raise an error). You doing it the right way is dependent on how performant you want your code to be. Like I said, the data is still being transferred from one process to another, and a new instance is being created every time it happens (without calling `__init__`, it instead calls `__new__` to create the instance with the data that was transferred). So if the data is large **and** you want reduce the overhead, consider using managers like I pointed out. – Charchit Agarwal Jul 07 '22 at 09:53
  • @Charchit I was calling main with the guard but not constructing the class with it, as above. Thanks, I'll look at managers, as there is a lot of redundant data being passed around – Dan Kingswell Jul 08 '22 at 09:01

0 Answers0