I'm trying to use joblib to make a custom random forest implementation train in parallel.
The task is embarrassingly parallel, so I assumed getting a speedup shouldn't be too hard with joblib.
Here's some sample code:
class RandomForest(object):
def __init__(self, settings, data):
self.forest = [None] * settings.n_trees
self.parallel = Parallel(n_jobs=settings.njobs, backend="threading")
def fit(self, data, train_ids_current_minibatch, settings, param, cache):
self.forest = self.parallel(
delayed(_parallel_build_trees_batch)(
i_t, data, train_ids_current_minibatch, settings, param, cache)
for i_t, tree in enumerate(self.forest))
def partial_fit(self, data, train_ids_current_minibatch, settings, param, cache):
self.forest = self.parallel(
delayed(_parallel_build_trees_partial)(
tree, i_t, data, train_ids_current_minibatch, settings, param, cache)
for i_t, tree in enumerate(self.forest))
However, the training is much slower when using more than one jobs, both in the batch and incremental case. The data and cache arguments are dicts that contain (large) numpy arrays, so I'm wondering if that is the cause.
I've tried coding the same using multiprocessing.Pool
and the results are even worse, as is not using the threading
backend of joblib, I assume because the fit functions make heavy use of numpy/scipy code.
Any ideas on how to debug/fix the slowdown?