I have a python class something like the following
from multiprocessing import Pool
class MapFitter():
def __init__(path):
# load a big data file
self.data = load_large_data_file(path)
def fit_model(data):
# fit some model to some data
model.fit(data)
return model
def main():
# fit models to everything in self.data
with Pool() as p:
models = p.map(self.fit_model, self.data)
return models
And then run it with the following:
fitter = MapFitter(path)
if __name__ == "__main__":
models = fitter.main()
So basically self.data
contains a load of variables that all need some model fitting to them, so this aims to parallelise that process. My question is, when p.map
is used in main
, is a new class instance created in all the workers and __init__
called several more times? I can't seem to a find an answer to this question, but when using this on data not stored locally it runs much slower, and I can see in task manager the worker processes spin up and then start hitting the network, suggesting they are each reloading the data again.
(I am new to multiprocessing so please suggest a better way to do this if I'm doing something obviously wrong)