I have a class defined and used as follows, that includes a method called in the context of a multiprocessing pool:
from multiprocessing import Pool
class MyClass:
def __init__(self, big_object):
self.big_object = big_object
def compute_one(self, myvar):
return self.big_object.do_something_with(myvar)
def compute_all(self, myvars):
with Pool() as pool:
results = pool.map(compute_one, myvars)
return results
big_object = pd.read_csv('very_big_file.csv')
cls = MyClass(big_object)
vars = ['foo', 'bar', 'baz']
all_results = cls.compute_all(vars)
This "works" but the issue is that big_object
takes several Gbs in RAM, and with the above code, this big object is copied in RAM for every process launched by multiprocessing.Pool
.
How to modify the above code so that big_object
would be shared among all processes, but still able to be defined at class instantiation?
-- EDIT
Some context into what I'm trying to do might help identify a different approach altogether.
Here, big_object
is a 1M+ rows Pandas dataframe, with tens of columns. And compute_one
computes statistics based on a particular column.
A simplified high-level view would be (extremely summarized):
- take one of the columns (e.g.
current_col = manufacturer
) - for each
rem_col
of the remaining categorical columns:- compute a
big_object.groupby([current_col, rem_col]).size()
- compute a
The end result of this would look like:
manufacturer | country_us | country_gb | client_male | client_female |
---|---|---|---|---|
bmw | 15 | 18 | 74 | 13 |
mercedes | 2 | 24 | 12 | 17 |
jaguar | 48 | 102 | 23 | 22 |
So in essence, this is about computing statistics for every column, on every other column, over the entire source dataframe.
Using multiprocessing here allows, for a given current_col
, to compute all the statistics on remaining_cols
in parallel. Among these statistics are not only sum
but also mean
(for remaining numerical columns).
A dirty approach using a global variable instead (a global big_object
instantiated from outside the class), takes the entire running time from 5+ hours to about 20 minutes. My only issue is that I'd like to avoid this global object approach.