Investigating joblib slowdown

Question

I'm trying to use joblib to make a custom random forest implementation train in parallel.

The task is embarrassingly parallel, so I assumed getting a speedup shouldn't be too hard with joblib.

Here's some sample code:

class RandomForest(object):
    def __init__(self, settings, data):
        self.forest = [None] * settings.n_trees
        self.parallel = Parallel(n_jobs=settings.njobs, backend="threading")

    def fit(self, data, train_ids_current_minibatch, settings, param, cache):
        self.forest = self.parallel(
            delayed(_parallel_build_trees_batch)(
                i_t, data, train_ids_current_minibatch, settings, param, cache)
            for i_t, tree in enumerate(self.forest))

    def partial_fit(self, data, train_ids_current_minibatch, settings, param, cache):
        self.forest = self.parallel(
            delayed(_parallel_build_trees_partial)(
                tree, i_t, data, train_ids_current_minibatch, settings, param, cache)
            for i_t, tree in enumerate(self.forest))

However, the training is much slower when using more than one jobs, both in the batch and incremental case. The data and cache arguments are dicts that contain (large) numpy arrays, so I'm wondering if that is the cause.

I've tried coding the same using multiprocessing.Pool and the results are even worse, as is not using the threading backend of joblib, I assume because the fit functions make heavy use of numpy/scipy code.

Any ideas on how to debug/fix the slowdown?

did my post answer your question? – hansaplast Dec 24 '17 at 18:46 — hansaplast, Dec 24 '17 at 18:46

hansaplast · Accepted Answer · 2017-12-22T07:28:34.050

Your analysis seems correct to me: The slowdown is caused by data and cache being large objects. Now, in a multiprocessing environment you don't have shared memory, so you need to somehow share these objects. Python support this via shared objects: there is a "main process" which really holds the object. But then the other processes need to send all updates over some mechanism (AFAIK the object is pickled and then sent via a pipe or queue) which slows it down.

I see some options for you:

transform your code so it uses partitions: I'm not familiar with random forest. I guess every process has data as an initial data set and then you try to find an "optimum". If you could push process 1 to find all "type A" optimums, and process 2 to find all "type B" optimums and then let every process e.g. write their findings on disk in file rf_process_x.txt then you don't need a shared memory state
transform your code so it uses queues (see last example on this page): If partitioning doesn't work, then maybe you could:
1. start up n worker processes
2. every process builds up his data set for himself (so it is not in shared memory)
3. in the main process you put "jobs" into the task_queue, e.g. find random forest with this specific set of parameters. The worker gets the job from the task_queue, computes it and puts its result on a result_queue. This is only fast if the tasks and results are slow as these objects need to be pickled and sent over a pipe from the parent process to the worker process.
use joblibs memmapping: Joblibs supports dumping the object onto disk and then give each object a memory-mapped access to that file
if your operation is not CPU bound (does heavy disk or network operations) you could move to multithreading. This way you really have a shared memory. But as far as I can see you are cpu bound and will run into the "GIL lock" issue (in cpython only one thread runs at a time)
you may find other ways of speeding up random forest, e.g. this SO answer mentions a few

Investigating joblib slowdown

1 Answers1