Very slow example parallel code in python, slower then serial

Question

I am trying to learn parallel programming with python 3 and have troubles with all the toy examples. Particularly, get any code from textbook/course/youtube, try to execute it and... get very slow working. I've actually never seen fast working examples for beginners. Everything is slow, if you can execute it. It is much slower then usual serial code with loops. Could anyone help with issue?

I work in Windows 10, Jupyter and use Intel Core i5-8300H 2.3 GHz, 4 physical cores and 8 threads.

I modified code from here but the same issue with other places.

My code:

    import numpy as np
    import time
    
    import multiprocessing as mp
    import additional
    
    # Prepare data
    sz = 10000000
    np.random.RandomState(100)
    arr = np.random.randint(0, 10, size=[sz, 5])
    data = arr.tolist()
    data[:5]
    # Step 1: Init multiprocessing.Pool()
    N = mp.cpu_count()
    print("number of processors: ", N)
    pool = mp.Pool(N)
    
    start = time.perf_counter()
    
    # Step 2: `pool.apply` the `howmany_within_range()`
    results = [pool.apply(additional.howmany_within_range, args=(row, 4, 8)) for row in data]
    
    finish = time.perf_counter()
    print(f'Finished in {round(finish-start, 3)} second(s)')
    
    # Step 3: Don't forget to close
    pool.close()    
    
    print(results[:10])
    
    #Serial code, loops
    results = []
    
    start = time.perf_counter()
    
    for row in data:
        results.append(additional.howmany_within_range(row, minimum=4, maximum=8))
    
    finish = time.perf_counter()
    print(f'Finished in {round(finish-start, 3)} second(s)')
    
    print(results[:10])

additional.py

    def howmany_within_range(row, minimum, maximum):
        """Returns how many numbers lie within `maximum` and `minimum` in a given `row`"""
        count = 0
        for n in row:
            if minimum <= n <= maximum:
                count = count + 1
        return count

It works with 10^7 elements

Parallel code:
number of processors: 8
Finished in 1563.35 second(s)

Serial calculations
Finished in 5.375 second(s)

Hi, from what I can see you are using some package that is unknown to me "additional". The function "howmany_within_range" seems to be called very often in your code. Maybe you should look for a replacement for that. Also you should leave a single CPU left for the host OS mp.Pool(N - 1). Last but not least Windows behaves a bit different to Linux in case of multiprocessing which might also be part of the difference in runtime. — DwightKendall, Mar 30 '22 at 16:29
Whenever in Python Interpreter, first check the add-on costs if using more processes - details added to Amdahl's Law here https://stackoverflow.com/revisions/18374629/3 + Whenever in Python Interpreter using Python-threads, forget about concurrency at all - GIL-lock is a central (exclusively borrowed ) MUTEX, so all threads wait, one thread owns a GIL-lock for some small amount of time ( ~ 100 [ms] ). Except when doing some super-slow over-the-network e2e-latency masking, Python threads actually decrease efficiency -- all wait, while just one computes ( all costs were paid & nothing gained ) — user3666197, Mar 30 '22 at 16:44
@DwightKendall What do you mean when say : "you should look for a replacement for that" ? — QFireball, Mar 30 '22 at 21:24
@user3666197 Thank you, but I know the theory. I would like to get real working example — QFireball, Mar 30 '22 at 21:33
@QFireball `pool.apply` is synchronous (waits for the result before continuing on) so you are basically not using parallel computing at all. Use `pool.apply_async` instead or better yet, use one of the `pool.map` variants. You can provide your static args by using `functools.partial(howmany_within_range, minimum=4, maximum=8)` — Aaron, Mar 30 '22 at 21:59
I assume you're following [this](https://hackanons.com/2021/06/parallel-python-programming-basic-information.html) tutorial... that one section: "You can also Parallelize using Pool.apply()" is just misleading... The function will be executed in a different process, but it won't let anything else run at the same time, making it useless to bother with the separate process. Probably just a small oversight on the part of the author, as it is mentioned earlier in the page that `apply` is synchronous. — Aaron, Mar 30 '22 at 22:06
@Aaron, you should also recognise, that if turning the flow of processing into an async-mapped iterator-driven, the resulting flow will never be (somewhat)-parallel ( on macroscopic scales ), but a just-concurrent ( do as you can, no order, no coordination, no even any final "syncing"-barrier ) flow of processing. That side-stepping permits the "main"-code to proceed serially forth, but as a result this is by far not a true-parallel processing ( not even "inside" such a pool of just-async-mapped task-farming orchestration ). Just pointing out cardinal differences not to call things wrong names — user3666197, Mar 30 '22 at 23:01
@user3666197 I'm not quite sure I follow what you're getting at.. `apply_async` is not related to coroutines if that's your angle. It puts a task to a queue that the pool of workers pull from, so it *is* truly parallel because the workers of the pool are separate processes not limited by the GIL, and each worker can be working on a task at the same time. The return value is a placeholder that a thread in the main process will populate when the worker process sends its results back after completion. — Aaron, Mar 31 '22 at 03:55
@Aaron Given an apply_async()-driven iterator dispatches tasks SEQUENTIALLY ( one after another, from the said task-Queue ) for a momentarily free pool-worker, i.e. with zero-coordination, with zero-warranty of all finishing in time ( actually having the very opposite certainty it will never happen so, unless artificially severe under-subscribing of the actual pool-capacity ( having more workers present, than tasks to run on 'em ) & if and only if, just by a pure coincidence, all tasks finish at the same time - which is almost never,is it ever so?) Result? A just-CONCURRENT execution, not more — user3666197, Mar 31 '22 at 09:40
@Aaron ^ your doubts are quite common, let's remind rules set by ***Theory of Systems*** ( rules valid elsewhere else ) -- https://stackoverflow.com/revisions/8337936/4 If claiming a System with a just-[CONCURRENT] flow of processing equals a True-[PARALLEL] system, try to imagine the awful cacophony of Metropolitan orchestra, if playing a piece - be it a Gershwin's West Side Story, or Ode for Joy - if a just-[CONCURRENT] system scheduling used: whoever came 1st: started, whoever late, started as arrived, whoever thirsty, left for a drink ... :o) +https://stackoverflow.com/revisions/62071962/1 — user3666197, Mar 31 '22 at 10:25
A mere adding word PARALLEL does not cause any code to start executing in a True-[PARALLEL] fashion, the very like adding a word DEMOCRATIC did not turn occupied territories into a any democracy ( There is no better example, how deliberately false-labeled names actually camouflage the reality of brutal tyrany - see how fast a Wiemar Republic turned into a motherland of concentration camps, and later The Holocaust got extended by "liberators" into another decade of extermination camps, now operated under auspices of "Democratic Republic" in https://en.wikipedia.org/wiki/Weimar#Weimar_Republic ) — user3666197, Mar 31 '22 at 11:26
So your complaint is entirely the pedantic semantics of concurrent vs parallel..... cool, thanks — Aaron, Mar 31 '22 at 16:25
To your point, "True Parallelism" doesn't exist in standard python only concurrency, so the argument is moot anyway. Being technically more correct in your language while at the same time confusing newcomers to the field is not helpful to anyone. — Aaron, Mar 31 '22 at 16:45
@Aaron you might already know,if not putting @-nick into a comment, an assumed recipient is not notified by an incoming message about a such comment in someone other's Question.On subject: No, Python Interpreter per-se does even a just-[CONCURRENT] flow of processing PREVENTION (GIL-lock MUTEX is a re-[SERIAL]-iser trick ***"All WAIT for just 1 WORKs"*** (as Guido van ROSSUM acknowledged since ever,this concurrency-avoider simplified & until now & any foreseeable future will keep simplifying the interpreted inner design). Process-based concurrency is somewhat possible, yet AT huge add-on COSTS — user3666197, Mar 31 '22 at 19:31

Frank Yellin · Answer 1 · 2022-03-30T17:36:08.993

1

What is howmany_within_range? Is it a quick operation or a slow operation.

Remember that multiprocessing has an overhead. Your main program has to package up the arguments, send them to another process, wait for the results, and then unpackage them. If the cost of packaging/unpackaging is more than the cost of what you're doing, then multiprocessing won't gain you anything.

edited Mar 30 '22 at 17:36

answered Mar 30 '22 at 16:29

Frank Yellin

9,127
1
12
22

Sorry to object the claim "multithreading", which is thread based ( and has zero-concurrency in Python, due to MUTEX GIL-lock Python Interpreter internal mechanics ), whereas if speaking about process-based ( distributed ) concurrency, "multiprocessing" ( or distributed alike in Ray et al ) are the proper terms. On add-on costs & Amdahl's Law argument extended to contain the overhead costs & atomicity-of-work extensions, see link in the comment above – user3666197 Mar 30 '22 at 16:47
1

@user3666197: You are absolutely correct. I wrote multithreading when I meant multiprocessing. I will update. – Frank Yellin Mar 30 '22 at 17:35
@FrankYellin I have added additional.py, sorry for late addition. How I classify "a quick operation or a slow operation"? I know that it can be execulted in the loop for 5seconds if I have 10^7 steps. And parallel code without loops for ~25 minutes. Yes, I know the theory about overhead, but is there real working examples with parralel processing advantages? I read the books, forums, youtube and all the examples like this above in the beginning of the topic, but noone compares real perfomance with serial code (at least from what I've seen ). – QFireball Mar 30 '22 at 21:28

score -1 · Answer 2 · edited Apr 03 '22 at 19:14

Q :_{"... much slower then usual serial code with loops.
Could anyone help with issue?"}

A :
Yes, we could help - first, let's agree what the issue is.

_{Tl;Dr; but important ...}

We will work using reproducible & evidence-based argumentation.

WORK PLAN - where is the PERFORMANCE gained or lost :

measure the pure-[SERIAL] code as-is
measure the pure-[SERIAL] code after performance was tuned-up
measure the pure-[SERIAL] code once called using multiprocessing

as-is code took 22,704,803 [us] ~ 22,7 sec ( ~ 5,375 sec on your localhost )

>>> from zmq import Stopwatch; aClk = Stopwatch()         # a [us]-granular clock
>>>
>>> aClk.start(); _ = [ _.append( howmany_within_range( row,
...                                                     minimum = 4,
...                                                     maximum = 8 )
...                               ) for row in data ]; aClk.stop()
22704803 [us]

improved code 1,877,015 [us] ~ 1.88 sec numpy-code ~ 12x faster & so will be on your localhost

Python Interpreter is known, since ever, to be slow on looping.

We can do better instead, the 1E7-large, outer, list-iterator based loop is a very frequent school-book example anti-pattern, the row-wise internal loop inside the how_many_within_range() is twice as bad anti-pattern ( besides calling 1E7-times the call-signature processing (passing data + parameters' decoding overheads), the row-wise for-loop iterator is, again, a slo-mo syntax constructor, here repeated 1E7-times - nothing to joy or celebrate, perhaps the book author was enthusiastic on efficiency devastation code practices - in that and only that case a performance devastator badge ought be awarded for such a few-SLOC performance anti-pattern ).

If we take time to understand what the code as-is actually calculates, we can right here improve the performance 12x better just by not losing a bit of time for non-productive steps.

>>> from zmq import Stopwatch; aClk = Stopwatch()         # a [us]-granular clock
>>>
>>> aClk.start(); _ = np.where( ( arr >= 4 )*( arr <= 8 ),# if condition met
...                             1,                        #    set cell 1
...                             0                         # else        0
...                             ).sum( axis = 1           # .sum() row-wise
...                                     ); aClk.stop()
1877015 [us]

measure the pure-[SERIAL] code once called using multiprocessing

If one has successfully understood the step 2), the step 3) is many times worse due to another factor - introduced by "upgrading" the already awfully bad mix of triple-inefficient iterator-driven looping seen above to a next, many orders of magnitude less efficient, level of inefficiency - as we will have to be paying 1E7-times (perhaps unexpectedly) expensive costs of passing (here at least small in RAM-size & easy to process, in serialisation-complexity terms) a set of parameters towards the "remote"-worker processes, each time a call-signature was called to do so.

1E7-times ... (!)

That happens at a cost of a sum of add-on [TimeDOMAIN] and [SpaceDOMAIN] costs of:

SER: a pickle.dumps()-SER-ialiser RAM-allocate/RAM-copy/CPU-process
XFER: acquiring and using the O/S-tools for process-to-process communication ( pipe on POSIX compliant O/S, or even spending extra costs on building & using protocol-stack enriched TCP-emulated pipe on other O/S )
DES: a pickle.loads()-DES-erialiser RAM-allocate/RAM-copy/CPU-process to decode the set of parameters received from XFER inside the "remote"-process

The costs of doing it this way are those that make up those about 5~22 seconds grow into observed minutes of slowed-down processing (awfully inefficient by all, not only these SER/XFER/DES, non-productive costs added - for the process-instantiation add-on overhead costs you may read raw details in the figure above or read the full story )

Last but not least, if that many [GB] got copied ( at a one-stop add-on cost ), the inefficiency problem does not stop here - see the Virtual Memory [MB] details above - as on systems with not so large physical-RAM ( easily getting into the [TB]-scales gigantic footprints ), the O/S starts operating a so called RAM-to-DISK swapping, so as to emulate as if there were so much RAM ( now, moving [GB]-blocks from ~ 300 [ns] RAM-storage, into a super-slower ~ 10,000,000 [ns] DISK-storage ( all through a bottleneck of a few physical-RAM memory-I/O channels - imagine a Formule 1 racing ring, all running above 200 mph, suddenly having to cross the Potomac river, using but a pair of steam-engine ferry-boats ... puff-puff-puff ... going in there own pace there and back, there and back, each time carrying not more than a few racing cars (DATA) -- that slow are data transfers during the RAM-to-DISK swapping emulations.

It does not straight crash the O/S - fine, but indeed no big racing since this starts in the middle of the work ... This is why it is often called RAM-thrashing ... so each new multiprocessing-spawned process moves you closer to this irreversible performance disaster (being a full copy of the Python Interpreter process - even if some folks keep saying here a fork-backend is possible not to do so large copy, actually it is not, as many O/S-es cannot fork at all, and most others claim insecure or even self-deadlocking side-effects if not using spawn-backend for indeed spawning a full, as stateful as possible copy of the main-Python Interpreter process (plus add to this that some problems even with this top-down Python Interpreter process full-copy statefulness ambition still remain in 2022-Q2, so rather be even more careful on this )

Very slow example parallel code in python, slower then serial

2 Answers2