Training sklearn models in parallel with joblib blocks the process

Question

As suggested in this answer, I tried to use joblib to train multiple scikit-learn models in parallel.

import joblib
import numpy
from sklearn import tree, linear_model

classifierParams = {
                "Decision Tree": (tree.DecisionTreeClassifier, {}),''
                "Logistic Regression" : (linear_model.LogisticRegression, {})
}


XTrain = numpy.array([[1,2,3],[4,5,6]])
yTrain = numpy.array([0, 1])

def trainModel(name, clazz, params, XTrain, yTrain):
    print("training ", name)
    model = clazz(**params)
    model.fit(XTrain, yTrain)
    return model


joblib.Parallel(n_jobs=4)(joblib.delayed(trainModel)(name, clazz, params, XTrain, yTrain) for (name, (clazz, params)) in classifierParams.items())

However, the call to the last line takes ages without utilizing the CPU, in fact it just seems to block and never return anything. What is my mistake?

A test with a very small amount of data in XTrain suggests that copying of the numpy array across multiple processes is not the reason for the delay.

The model matters (in the overall procedure; as some are already parallel and you will only make things slower) and it's not shown. Then it also looks like you did not read joblibs docs. E.g. the part about shared-memory -> memmaps, which probably results in tons of copies here. — sascha, Nov 29 '17 at 11:48
Some of the sklearn models allow you to pass the `n_jobs` argument directly. E.g.: `sklearn.ensemble.RandomForestClassifier(n_estimators=10, criterion=’gini’, max_depth=None, n_jobs=-1)` This wouldn't train multiple models simultaneously, but it would train a single model across multiple cores, potentially speeding up your implementation. — collector, Nov 29 '17 at 12:13
@sascha How would copying explain the blocked process without CPU utilization? — clstaudt, Nov 29 '17 at 12:18
That's basic CS, right? Before copying is completed, there is nothing to work on. If frequent copying is needed; CPU is starving (IO [slower](https://i.stack.imgur.com/a7jWu.png) than CPU). If these copies bloat your RAM, thrashing occurs which effects in the same. What's unclear? (of course it's just guesswork here, but with such a sparse and unprepared question) — sascha, Nov 29 '17 at 12:20
@sascha Lots of guesswork indeed on your part (see edit). May I suggest you take little a break from stackoverflow to cool off, work a little on your interpersonal communication skills and let others have a shot? — clstaudt, Nov 29 '17 at 12:28
I see. I'm now criticized for wrong assumptions i made, before editing your question to give more info. Now using small data not really explains your blocking, but small data will also never gain you anything on such coarse-grained parallelization. The biggest problem with this question is the missing of details. It's not reproducible, does not show what is parallelized and OS-info is also not given. Seeing those problems i would guess again: OS X. — sascha, Nov 29 '17 at 12:41
Don't get me wrong, you are criticized for being unpleasant, condescending and angry in your writing style. The fact that the technical content of your comments points in the wrong direction is another matter. — clstaudt, Nov 29 '17 at 12:51
Well it seems my comments are educational (seeing your edits). Your snippet works on my end. So you probably now know what you should look for (after potential updates for all used libs): the joblib-OS interaction (github issues!). In my empirical and subjective experience: OS X more often affected by this. — sascha, Nov 29 '17 at 13:10
@collector while your remark is **lexically correct**, you have missed the point of the clstaudt's O/P intention, and your comment helped nothing in the direction of the O/P set of questions. Having the same name of some attribute has nothing to do with the observed problem the O/P asked to get some help with. — user3666197, Nov 29 '17 at 13:10
@user3666197 That is why it is a comment and not an answer. I also explicitly state this difference in my comment. — collector, Nov 29 '17 at 14:21
@sasha You were of course right to demand more reproducible code in the question, but rather than asking for that you came off as angry and dismissive, which is irritating and just results in a waste of both our time before we get to the bottom of the issue. The OS is Windows, by the way. — clstaudt, Nov 29 '17 at 21:49
With all due respect, sir, have you at least measured a single round of the actual **`joblib.Parallel()`** constructor instantiations' overheads before voting down based on **a quantitatively supported disagreement, or was this an act of a pure hate?** — user3666197, Nov 29 '17 at 21:52

user3666197 · Answer 1 · 2017-12-08T18:04:59.710

Production-grade Machine Learning pipelines have CPU utilisations more like this, almost 24 / 7 / 365:

Check both the CPU% and also other resources' state figures across this node.

What is my mistake?

Having read your profile was a stunning moment, Sir:

_{I am a computer scientist specializing on algorithms and data analysis by training, and a generalist by nature. My skill set combines a strong scientific background with experience in software architecture and development, especially on solutions for the analysis of big data. I offer consulting and development services and I am looking for challenging projects in the area of data science.}

The problem IS deeply determined by a respect to elementary Computer Science + algorithm rules.

The problem IS NOT demanding a strong scientific background, but a common sense.

The problem IS NOT any especially Big Data but requires to smell how the things actually work.

Facts
or
Emotions? ... that's The Question! _{( The tragedy of Hamlet, Prince of Denmark )}

May I be honest? Let's prefer FACTS, always:

Step #1:
Never hire or fire straight each and every Consultant, who does not respect facts ( the answer referred above did not suggest anything, the less granted any promises ). Ignoring facts might be a "successful sin" in PR / MARCOM / Advertisement / media businesses _{( in case The Customer tolerates such dishonesty and/or manipulative habit )}, but not in a scientifically fair quantitative domains. This is unforgivable.

Step #2:
Never hire or fire straight each and every Consultant, who claimed having experience in software architecture, especially on solutions for ... big data but pays zero attention to the accumulated lumpsum of all the add-on overhead costs that are going to be introduced by each of the respective elements of the system architecture, once the processing started to go distributed across some pool of hardware and software resources. This is unforgivable.

Step #3:
Never hire or fire straight each and every Consultant, who turns passive aggressive once facts do not fit her/his wishes and starts to accuse other knowledgeable person who have already delivered a helping hand to rather "improve ( their ) communication skills" instead of learning from mistake(s). Sure, skill may help to express the obvious mistakes in some other way, yet, the gigantic mistakes will remain gigantic mistakes and each and every scientist, being fair to her/his scientific title, should NEVER resort to attack on a helping colleague, but rather start searching for the root cause of the mistakes, one after the other. This ---

_{@sascha ... May I suggest you take little a break from stackoverflow to cool off, work a little on your interpersonal communication skills}

--- was nothing but a straight and intellectually unacceptable nasty foul to @sascha.

Next, the toys
The architecture, Resources and Process-scheduling facts that matter:

The imperative form of a syntax-constructor ignites an immense amount of activities to start:

joblib.Parallel( n_jobs = <N> )( joblib.delayed( <aFunction> )
                                               ( <anOrderedSetOfFunParameters> )
                                           for ( <anOrderedSetOfIteratorParams> )
                                           in    <anIterator>
                                 )

To at least guess what happens, a scientifically fair approach would be to test several representative cases, benchmarking their actual execution, collect quantitatively supported facts and draw a hypothesis on a model of behaviour and its principal dependencies on CPU_core-count, on RAM-size, on <aFunction>-complexity and resources-allocation envelopes etc.

Test case A:

def a_NOP_FUN( aNeverConsumedPAR ):
    """                                                 __doc__
    The intent of this FUN() is indeed to do nothing at all,
                             so as to be able to benchmark
                             all the process-instantiation
                             add-on overhead costs.
    """
    pass

##############################################################
###  A NAIVE TEST BENCH
##############################################################
from zmq import Stopwatch; aClk = Stopwatch()
JOBS_TO_SPAWN =  4         # TUNE:  1,  2,  4,   5,  10, ..
RUNS_TO_RUN   = 10         # TUNE: 10, 20, 50, 100, 200, 500, 1000, ..
try:
     aClk.start()
     joblib.Parallel(  n_jobs = JOBS_TO_SPAWN
                      )( joblib.delayed( a_NOP_FUN )
                                       ( aSoFunPAR )
                                   for ( aSoFunPAR )
                                   in  range( RUNS_TO_RUN )
                         )
except:
     pass
finally:
     try:
         _ = aClk.stop()
     except:
         _ = -1
         pass
print( "CLK:: {0:_>24d} [us] @{1: >3d} run{2: >5d} RUNS".format( _,
                                                                 JOBS_TO_SPAWN,
                                                                 RUNS_TO_RUN
                                                                 )
        )

Having collected representatively enough data on this NOP-case over a reasonably scaled 2D-landscape of [ RUNS_TO_RUN, JOBS_TO_SPAWN]-cartesian-space DataPoints, so as to generate at least some first-hand experience of the actual system costs of launching an actually intrinsically empty-processes' overhead workloads, related to the imperatively instructed joblib.Parallel(...)( joblib.delayed(...) )-syntax constructor, spawning into the system-scheduler just a few joblib-managed a_NOP_FUN() instances.

Let's also agree that all the real-world problems, Machine Learning models included, are way more complex tools, that the just tested a_NOP_FUN(), while in both cases, you have to pay the already benchmarked overhead costs ( even if it was paid for getting literally zero product ).

Thus a scientifically fair, rigorous work will follow from this simplest ever case, already showing the benchmarked costs of all the associated setup-overheads a smallest ever joblib.Parallel() penalty sine-qua-non forwards into a direction, where real world algorithms live - best with next adding some larger and larger "payload"-sizes into the testing loop:

Test-case B:

def a_NOP_FUN_WITH_JUST_A_MEM_ALLOCATOR( aNeverConsumedPAR ):
    """                                                 __doc__
    The intent of this FUN() is to do nothing but
                             a MEM-allocation
                             so as to be able to benchmark
                             all the process-instantiation
                             add-on overhead costs.
    """
    import numpy as np              # yes, deferred import, libs do defer imports
    SIZE1D    = 1000                # here, feel free to be as keen as needed
    aMemALLOC = np.zeros( ( SIZE1D, #       so as to set
                            SIZE1D, #       realistic ceilings
                            SIZE1D, #       as how big the "Big Data"
                            SIZE1D  #       may indeed grow into
                            ),
                          dtype = np.float64,
                          order = 'F'
                          )         # .ALLOC + .SET
    aMemALLOC[2,3,4,5] = 8.7654321  # .SET
    aMemALLOC[3,3,4,5] = 1.2345678  # .SET

    return aMemALLOC[2:3,3,4,5]

Again,
collect a representatively enough quantitative data about the costs of actual remote-process MEM-allocations, by running a a_NOP_FUN_WITH_JUST_A_MEM_ALLOCATOR() over some reasonable wide landscape of SIZE1D-scaling,
again
over a reasonably scaled 2D-landscape of [ RUNS_TO_RUN, JOBS_TO_SPAWN]-cartesian-space DataPoints, so as to touch a new dimension in the performance scaling, under an extended black-box PROCESS_under_TEST experimentation inside the joblib.Parallel() tool, leaving its magics yet left un-opened.

Test-case C:

def a_NOP_FUN_WITH_SOME_MEM_DATAFLOW( aNeverConsumedPAR ):
    """                                                 __doc__
    The intent of this FUN() is to do nothing but
                             a MEM-allocation plus some Data MOVs
                             so as to be able to benchmark
                             all the process-instantiation + MEM OPs
                             add-on overhead costs.
    """
    import numpy as np              # yes, deferred import, libs do defer imports
    SIZE1D    = 1000                # here, feel free to be as keen as needed
    aMemALLOC = np.ones(  ( SIZE1D, #       so as to set
                            SIZE1D, #       realistic ceilings
                            SIZE1D, #       as how big the "Big Data"
                            SIZE1D  #       may indeed grow into
                            ),
                          dtype = np.float64,
                          order = 'F'
                          )         # .ALLOC + .SET
    aMemALLOC[2,3,4,5] = 8.7654321  # .SET
    aMemALLOC[3,3,4,5] = 1.2345678  # .SET

    aMemALLOC[:,:,:,:]*= 0.1234567
    aMemALLOC[:,3,4,:]+= aMemALLOC[4,5,:,:]
    aMemALLOC[2,:,4,:]+= aMemALLOC[:,5,6,:]
    aMemALLOC[3,3,:,:]+= aMemALLOC[:,:,6,7]
    aMemALLOC[:,3,:,5]+= aMemALLOC[4,:,:,7]

    return aMemALLOC[2:3,3,4,5]

Bang, The Architecture related issues start to slowly show up:

One may soon notice, that not only the static-sizing matters, but also the MEM-transport BANDWIDTH ( hardware-hardwired ) will start cause problems, as moving data from/to CPU into/from MEM costs well ~ 100 .. 300 [ns], a way more, than any smart-shuffling of the few bytes "inside" the CPU_core, { CPU_core_private | CPU_core_shared | CPU_die_shared }-cache hierarchy-architecture alone ( and any non-local NUMA-transfer exhibits the same order of magnitude add-on pain ).

All the above Test-Cases have not asked much efforts from CPU yet

So let's start to burn the oil!

If all above was fine for starting to smell how the things under the hood actually work, this will grow to become ugly and dirty.

Test-case D:

def a_CPU_1_CORE_BURNER_FUN( aNeverConsumedPAR ):
    """                                                 __doc__
    The intent of this FUN() is to do nothing but
                             add some CPU-load
                             to a MEM-allocation plus some Data MOVs
                             so as to be able to benchmark
                             all the process-instantiation + MEM OPs
                             add-on overhead costs.
    """
    import numpy as np              # yes, deferred import, libs do defer imports
    SIZE1D    = 1000                # here, feel free to be as keen as needed
    aMemALLOC = np.ones(  ( SIZE1D, #       so as to set
                            SIZE1D, #       realistic ceilings
                            SIZE1D, #       as how big the "Big Data"
                            SIZE1D  #       may indeed grow into
                            ),
                          dtype = np.float64,
                          order = 'F'
                          )         # .ALLOC + .SET
    aMemALLOC[2,3,4,5] = 8.7654321  # .SET
    aMemALLOC[3,3,4,5] = 1.2345678  # .SET

    aMemALLOC[:,:,:,:]*= 0.1234567
    aMemALLOC[:,3,4,:]+= aMemALLOC[4,5,:,:]
    aMemALLOC[2,:,4,:]+= aMemALLOC[:,5,6,:]
    aMemALLOC[3,3,:,:]+= aMemALLOC[:,:,6,7]
    aMemALLOC[:,3,:,5]+= aMemALLOC[4,:,:,7]

    aMemALLOC[:,:,:,:]+= int( [ np.math.factorial( x + aMemALLOC[-1,-1,-1] )
                                               for x in range( 1005 )
                                ][-1]
                            / [ np.math.factorial( y + aMemALLOC[ 1, 1, 1] )
                                               for y in range( 1000 )
                                ][-1]
                              )

    return aMemALLOC[2:3,3,4,5]

Still nothing extraordinary, compared to the common grade of payloads in the domain of a Machine Learning many-D-space, where all dimensions of the { aMlModelSPACE, aSetOfHyperParameterSPACE, aDataSET }-state-space impact the scope of the processing required ( some having O( N ), some other O( N.logN ) complexity ), where almost immediately, where well engineered-in more than just one CPU_core soon gets harnessed even on a single "job"-being run.

An indeed nasty smell starts, once a naive ( read resources-usage un-coordinated ) CPU-load mixtures get down the road and when mixes of task-related CPU-loads start to get mixed with naive ( read resources-usage un-coordinated ) O/S-scheduler processes happen to fight for common ( resorted to just a naive shared-use policy ) resources - i.e. MEM ( introducing SWAPs as HELL ), CPU ( introducing cache-misses and MEM re-fetches ( yes, with SWAPs penalties added ), not speaking about paying any kind of more than ~ 15+ [ms] latency-fees, if one forgets and lets a process to touch a fileIO-( 5 (!)-orders-of-magnitude slower + shared + being a pure-[SERIAL], by nature )-device. No prayers help here ( SSD included, just a few orders of magnitude less, but still a hell to share & running a device incredibly fast into its wear + tear grave ).

What happens, if all the spawned processes do not fit into the physical RAM?

Virtual memory paging and swaps start to literally deteriorate the rest of the so far somehow "just"-by-coincidence-( read: weakly-co-ordinated )-[CONCURRENTLY]-scheduled processing ( read: further-decreased individual PROCESS-under-TEST performance ).

Things may so soon go wreck havoc, if not under due control & supervision.

Again - fact matters: a light-weight resources-monitor class may help:

aResRECORDER.show_usage_since0() method returns:

ResCONSUMED[T0+   166036.311 (           0.000000)]
user=               2475.15
nice=                  0.36
iowait=                0.29
irq=                   0.00
softirq=               8.32
stolen_from_VM=       26.95
guest_VM_served=       0.00

Similarly a bit richer constructed resources-monitor may report a wider O/S context, to see where additional resource stealing / contention / race-conditions deteriorate the actually achieved process-flow:

>>> psutil.Process( os.getpid()
                    ).memory_full_info()
                                      ( rss          =       9428992,
                                        vms          =     158584832,
                                        shared       =       3297280,
                                        text         =       2322432,
                                        lib          =             0,
                                        data         =       5877760,
                                        dirty        =             0
                                        )
           .virtual_memory()
                          (             total        =   25111490560,
                                        available    =   24661327872,
                                        percent      =             1.8,
                                        used         =    1569603584,
                                        free         =   23541886976,
                                        active       =     579739648,
                                        inactive     =     588615680,
                                        buffers      =             0,
                                        cached       =    1119440896
                                        )
           .swap_memory()
                       (                total        =    8455712768,
                                        used         =     967577600,
                                        free         =    7488135168,
                                        percent      =            11.4,
                                        sin          =  500625227776,
                                        sout         =  370585448448
                                        )

Wed Oct 19 03:26:06 2017
        166.445 ___VMS______________Virtual Memory Size  MB
         10.406 ___RES____Resident Set Size non-swapped  MB
          2.215 ___TRS________Code in Text Resident Set  MB
         14.738 ___DRS________________Data Resident Set  MB
          3.305 ___SHR_______________Potentially Shared  MB
          0.000 ___LIB_______________Shared Memory Size  MB
                __________________Number of dirty pages           0x

Last but not least, why one can easily pay more than earn in return?

Besides the gradually built records of evidence, how the real-world system-deployment add-on overheads accumulate the costs, the recently re-formulated Amdahl's Law, extended so as to cover both the add-on overhead-costs plus the "process-atomicity" of the further indivisible parts' sizing, defines a maximum add-on costs threshold, that might be reasonable paid, if some distributed processing is to provide any above >= 1.00 computing process speedup.

Dis-obeying the explicit logic of the re-formulated Amdahl's Law causes a process to proceed worse than if having been processed in a pure-[SERIAL] process-scheduling ( and sometimes the results of poor design and/or operations practices may look as if it were a case, when a joblib.Parallel()( joblib.delayed(...) ) method "blocks the process" ).

@sascha As you have noted **joblib** having issues on OSX v?.?, there might be of an interest an experience from CRAY-Chapel team, on a default backend compiler prior to OSX 10.12+ ( re-compilation from source on 10.12+ & clang 9 was reported to stop exhibit indeed strange performance issues, so if this might also help the **joblib** community, the better ). **Details in >>>** https://stackoverflow.com/q/47561759/3666197 — user3666197, Nov 30 '17 at 11:00
Does "determined" in `The problem IS deeply determined by a respect to elementary Computer Science + algorithm rules` mean something like "decided by"? — , Aug 29 '19 at 16:01

Training sklearn models in parallel with joblib blocks the process

1 Answers1

Production-grade Machine Learning pipelines have CPU utilisations more like this, almost 24 / 7 / 365:

What is my mistake?

Facts
or
Emotions? ... that's The Question! _{( The tragedy of Hamlet, Prince of Denmark )}

May I be honest? Let's prefer FACTS, always:

Next, the toys
The architecture, Resources and Process-scheduling facts that matter:

Test case A:

Test-case B:

Test-case C:

Bang, The Architecture related issues start to slowly show up:

All the above Test-Cases have not asked much efforts from CPU yet

Test-case D:

What happens, if all the spawned processes do not fit into the physical RAM?

Things may so soon go wreck havoc, if not under due control & supervision.

Last but not least, why one can easily pay more than earn in return?

Linked

Training sklearn models in parallel with joblib blocks the process

1 Answers1

Production-grade Machine Learning pipelines have CPU utilisations more like this, almost 24 / 7 / 365:

What is my mistake?

FactsorEmotions? ... that's The Question! ( The tragedy of Hamlet, Prince of Denmark )

May I be honest? Let's prefer FACTS, always:

Next, the toysThe architecture, Resources and Process-scheduling facts that matter:

Test case A:

Test-case B:

Test-case C:

Bang, The Architecture related issues start to slowly show up:

All the above Test-Cases have not asked much efforts from CPU yet

Test-case D:

What happens, if all the spawned processes do not fit into the physical RAM?

Things may so soon go wreck havoc, if not under due control & supervision.

Last but not least, why one can easily pay more than earn in return?

Linked

Facts
or
Emotions? ... that's The Question! _{( The tragedy of Hamlet, Prince of Denmark )}

Next, the toys
The architecture, Resources and Process-scheduling facts that matter: