I'm using the multiprocessing python library to run in parallel feature selection for a machine learning problem. This function accepts as input a pandas dataframe and returns some figures.
When I execute this function using mp.pool.map()
everything runs smoothly. However, if I substitute it with mp.pool.ThreadPool.map()
it fails with this error:
AssertionError: Number of manager items must equal union of block items
# manager items: 15, # tot_items: 20
.
Strangely, I was running the ThreadPool
code normally till yesterday. Then, I tried to re-run it and started getting these errors. I need ThreadPool
since this is an IO bound job and it was running much faster compared to pool
.
EDIT: The code goes like that (python 2.7):
import multiprocessing as mp
import pandas as pd (version 0.22.0)
def main_functionality(df, params):
df = df[params['feature']]
#Run 5-fold cross-validation
data_df = pd.DataFrame(....)
pred_df = pred_df.append(data_df)
return statistics from pred_df
def a_function(df_init, feature, params_init):
params = dict(params_init)
df = df_init.copy()
params['feature'] = feature
try:
results = main_functionality(df, params)
except:
results = (0,0,0)
return results
def b_function(df, features):
pool = mp.pool.ThreadPool(4)
params = {...}
results = pool.map(a_function,(df, feature, params) for f in features))
results_df = pd.DataFrame(results)
results_df.to_csv(...)
if __name__ == '__main__':
df = read.csv(...) # A big CSV file (i.e. few GBs)
features = [i for i in df.columns if i ....]
b_function(df, features)