Using dask as for task scheduling to run machine learning models in parallel

Question

So basically what I want is to run ML Pipelines in parallel. I have been using scikit-learn, and I have decided to use DaskGridSearchCV.

I have is a list of gridSearchCV = DaskGridSearchCV(pipeline, grid, scoring=evaluator) objects, and I run each of them sequentially:

for gridSearchCV in list:
    gridSearchCV.fit(train_data, train_target)
    predicted = gridSearchCV.predict(test_data)

If I have N different GridSearch objects, I want to take advantage as much as possible of all the available resources. If there are resources to run 2, 3, 4, ... or N at the same time in parallel, I want to do so.

So I started trying a few things based on dask's documentation. First I tried dask.threaded and dask.multiprocessing but it ends up being slower and I keep getting:

/Library/Python/2.7/site-packages/sklearn/externals/joblib/parallel.py:540: UserWarning: Multiprocessing backed parallel loops cannot be nested below threads, setting n_jobs=1

This is the code snippet:

def run_pipeline(self, gs, data):

    train_data, test_data, train_target, expected = train_test_split(data, target, test_size=0.25, random_state=33)

    model = gs.fit(train_data, train_target)
    predicted = gs.predict(test_data)


values = [delayed(run_pipeline)(gs, df) for gs in gs_list]
compute(*values, get=dask.threaded.get)

Maybe I am approaching this the wrong way, would you have any suggestions for me?

GridSearch is already parallelized: it tests across the grid in parallel. — Arya McCarthy, May 08 '17 at 01:14
@aryamccarthy thank you for your comment. Yes, but I have a list of GridSearch objects, for example one using DecisionTree and another with RandomForest. And I wanna run them in parallel as long as there are resources for it — Larissa Leite, May 08 '17 at 01:21
If there are resources and they are completely independent, why not? — Larissa Leite, May 08 '17 at 01:25
Because there's one resource that's not totally independent: your cores. You can only run one job per core at a time. Let's say you have n cores. Whether you have n separate gridsearches running one sim at a time or one gridsearch at a time running n sims, it's all the same in the end. — Arya McCarthy, May 08 '17 at 01:28
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/143687/discussion-between-larissa-leite-and-aryamccarthy). — Larissa Leite, May 08 '17 at 12:02

jiminy_crist · Answer 1 · 2017-05-08T13:56:00.100

Yes, but I have a list of GridSearch objects, for example one using DecisionTree and another with RandomForest. And I wanna run them in parallel as long as there are resources for it.

If this is your goal, I would merge them all into the same grid. Scikit-Learn Pipelines support grid-search across steps, which would allow you to do your search in only a single GridSearchCV object (for an example of this from the scikit-learn docs, see here). If you only have a single estimator (instead of a pipeline), you can use a Pipeline with a single step as a proxy. For example:

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
import dask_searchcv as dcv

pipeline = Pipeline([('est', DecisionTreeClassifier())])

grid = [
    {'est': [DecisionTreeClassifier()],
     'max_features': ['sqrt', 'log2'],
     # more parameters for DecisionTreeClassifier
    },
    {'est': [RandomForestClassifier()],
     'max_features': ['sqrt', 'log2'],
     # more parameters for RandomForesetClassifier
    },
    # more estimator/parameter subsets
]

gs = dcv.GridSearchCV(pipeline, grid)
gs.fit(train_data, train_target)
gs.predict(test_data)

Note that for this specific case (where all estimators share the same parameters, you can merge the grid:

grid = {'est': [DecisionTreeClassifier(), RandomForestClassifier()],
        'max_features': ['sqrt', 'log2'],
        # more parameters for all estimators}

As far as to why your delayed example didn't work - dask.delayed is for wrapping functions that don't call dask code. Since you're calling fit on a dask_searchcv.GridSearchCV object (which uses dask to compute) inside the delayed function (which also uses dask to compute), you're nesting calls to the dask scheduler, which can lead to poor performance at best, and weird bugs at worst.

Thank you for your answer! just one question before I try it out: in this line `pipeline = Pipeline([('est', DecisionTreeClassifier())])` is it correct to keep only the DecisionTree even though in the grid there can be N estimators? — Larissa Leite, May 08 '17 at 14:00
In this case the `DecisionTreeClassifier` is just a placeholder - each parameter set in the grid includes a replacement for the `'est'` step, so it doesn't really matter what you put there. Does that make sense? — jiminy_crist, May 08 '17 at 14:10
Ah, ok, I get it now! Thank you, I will give it a try and let you know how it goes — Larissa Leite, May 08 '17 at 14:14
Is there a way to set different scoring metrics for the different estimators? — Larissa Leite, May 11 '17 at 10:56
One way to do this would be to define your own scorer (signature `scorer(estimator, X, y)`, and use a different scorer inside that depending on the type of `estimator`. Then pass this to the `scoring` parameter of `GridSearchCV` (see docs http://dask-searchcv.readthedocs.io/en/latest/api.html#dask_searchcv.GridSearchCV), — jiminy_crist, May 11 '17 at 16:16

Using dask as for task scheduling to run machine learning models in parallel

1 Answers1