So basically what I want is to run ML Pipelines in parallel.
I have been using scikit-learn, and I have decided to use DaskGridSearchCV
.
I have is a list of gridSearchCV = DaskGridSearchCV(pipeline, grid, scoring=evaluator)
objects, and I run each of them sequentially:
for gridSearchCV in list:
gridSearchCV.fit(train_data, train_target)
predicted = gridSearchCV.predict(test_data)
If I have N different GridSearch
objects, I want to take advantage as much as possible of all the available resources. If there are resources to run 2, 3, 4, ... or N at the same time in parallel, I want to do so.
So I started trying a few things based on dask's documentation. First I tried dask.threaded
and dask.multiprocessing
but it ends up being slower and I keep getting:
/Library/Python/2.7/site-packages/sklearn/externals/joblib/parallel.py:540: UserWarning: Multiprocessing backed parallel loops cannot be nested below threads, setting n_jobs=1
This is the code snippet:
def run_pipeline(self, gs, data):
train_data, test_data, train_target, expected = train_test_split(data, target, test_size=0.25, random_state=33)
model = gs.fit(train_data, train_target)
predicted = gs.predict(test_data)
values = [delayed(run_pipeline)(gs, df) for gs in gs_list]
compute(*values, get=dask.threaded.get)
Maybe I am approaching this the wrong way, would you have any suggestions for me?