Parallelizing for loop in Python

Question

I coded a neural network that runs really slow, so I was hoping to speed it up a little by parallelizing a certain loop.

I am not sure about the implementation and how the GIL works and if its relevant for me.

The code looks like this:

class net():
    def train(inputs)
        ...
        for tset in batch:
            self.backprop(tset, learn_rate)

    def backprop(inputs):
        ...
        for c, w in zip(n.inconnections, range(len(n.inconnections))):
            c.current_error += n.current_error * n.inconnectionweights[w]
            n.icweightupdates[w] += - learn_rate * n.current_error * c.activation
        n.biasupdate += -learn_rate * n.current_error

The loop in train() is the one I want to parallelize since batch contains a set of training samples (20) that could be processed independently.

user3666197 · Accepted Answer · 2017-12-08T18:10:56.723

_{While "AGILE" evangelisation gang beats their drums louder and louder,}

Never start coding before you know it makes sense to do so:

Why?

You can spend indeed an infinite amount of time in coding a thing, that has no sense to even start with, whereas a reasonable amount of time spent on due system-engineering will help you determine, what steps are reasonable and what are not.

So, let's start from a common ground: what is you r motivation for a parallelisation -- Yes, PERFORMANCE.

If we both agree on this, let's review this new domain of expertise.

It is not hard to sketch a code, which is awfully bad at this.

It is increasingly important to become able do both understand and to design code, that can harness the contemporary hardware infrastructures to their maximum performance possible.

parallel-processing in general is intended for many other reasons, not just the performance -- your case, as proposed above, is at the end actually a "just"-[CONCURRENT]-type of process-scheduling. Not true-[PARALLEL]? Sure, if it were not at a machine having more than 21 CPU-cores, HyperThreading off and all Operating-System processes stopped for the moment of being, until all 20 examples got processed there and back, throughout all the global minimiser loops got converged.

And imagine, you try to run just 20 Machine Learning inputs ( examples ), while the real-world DataSETs have many thousands ( well beyond many hundreds thousands in my problem-domain ) examples to process, so will never get into such an extreme in a true-[PARALLEL] process-scheduling, using this way.

Best start to understand the Amdahl's Law in its full context first ( an overhead-naive formulation does a bad service in this for freshmen experimentation -- better master in full-details also the Criticism section of the updated post on both the overheads and resources-bound limits first, before voting for going into a "parallelisation at any cost" -- even if a half-dozen "wannabe-gurus" advise you to do so and so many PR-motivated media today shout on you to go-parallel ).

Next, read this about details and differences, that may come just from using better or more appropriate tools ( Once having understood the Amdahl's Law principal Speedup ceilings, the speedup of +512x will shock you and set the gut feeling, what makes and what does not make sense ). This is relevant for every performance bottleneck review and re-engineering. Most Neural Networks spend immense amount of time ( not due a poor performance of a code, but because of immense DataSET-sizes ) in re-runing Feed-Forward + Back-Propagation phases, where vectorised code is harder to design, than using a plain python code, but this is where performance could be gained, not lost.

Python can rather use the smart vectorisation tools from numpy plus may design your code to harness minimum-overheads by systematic use of a view, instead of repetitively losing performance on making memory-allocations once passing copies of dense-matrices in Neural Networks, if python-implementations are to show its possible speeds.

There are so many cases, when just-coding will generate an awfully bad processing performance at whatever craftmanship the code may pour in, whereas a bit of mathematics will show a smart re-engineering of the computing process, and process-performance has suddenly jumped a few orders of magnitude ( yes, 10x, 100x faster just by using human brain and critical thinking ).

Yes, it is hard to design a fast code, but nobody has promised you a dinner to be for free, did he?

Last, but not least:

never let all [CONCURRENT]-tasks do an exactly the same job, the less to always re-repeat it:

for c, w in zip( n.inconnections, range( len( n.inconnections ) ) ):
    ...

this syntax is easy to code, but introduces a re-calculated zip() processing into each and every "wanted-to-get-accelerated" task. No. Indeed a bad idea, if performance is still in mind.

the primary obejective was to have a very understandable code, more as a prove of concept. Wanting to parallelize it was just due to me looking at the task manager and thinking "gee its only using 20% of the CPU, maybe it would run a little faster if it runs parallel". Nonetheless great answer and good linking, i hope it gets some more votes so its easier to search for other people. — Eumel, Nov 09 '17 at 08:56

score 0 · Answer 2 · answered Nov 08 '17 at 15:04

0

Python threads will not make this faster, because of the GIL. Instead, you could try processes:

import multiprocessing

class net():
    def train(inputs)
        ...
        with multiprocessing.Pool() as p:
            biasupdates = p.map(lambda tset: self.backprop(tset, learn_rate), batch)
        n.biasupdate += sum(biasupdates)

    def backprop(inputs):
        ...
        for c, w in zip(n.inconnections, range(len(n.inconnections))):
            c.current_error += n.current_error * n.inconnectionweights[w]
            n.icweightupdates[w] += - learn_rate * n.current_error * c.activation
        return -learn_rate * n.current_error

answered Nov 08 '17 at 15:04

Valentin Lorentz

9,556
6
47
69

why exactly did you put biasupdates into the outer loop? Im also direktly saving the updates (bias and weight) into variables of my Neurons, what do i need the return for? – Eumel Nov 08 '17 at 15:10
With all due respect, Valentin, you have not told Eumel the whole story -- add-on costs that you recommend to pay for spawning a `multiprocessing.Pool()` being **the biggest sin** a Computer Science aware professional may commit to an interested student. While your code may remain syntactically correct ( and has kept a promise to escape from the GIL-stepped duck-duck-go dancing ), it is principally flawed **1)** by adding "immense" add-on setup/termination processing overheads + **2)** by repeating "immense" portion of work inside parallelised blocks of code. You might want to revise the post. – user3666197 Nov 08 '17 at 15:44
@Eumel I did this because backprop is run in a separate process, so it cannot directly change variables in the original process. The return is to send the value to the original process. – Valentin Lorentz Nov 08 '17 at 16:08
@user3666197 I know, but it is worth paying the overhead if backprop takes a long time to run. I merely did a suggestion, OP can test both versions and keep the fastest one. – Valentin Lorentz Nov 08 '17 at 16:09
Valentin, the NN-backprop **never** "*takes a long time to run*". If it would, the whole Neural Network artillery would become useless, as being *almost* un-trainable in any near future. The very opposite is true. Backprop is the NN most performance-polished piece of software. Raising a proposal to "delegate" a single-shot process instantiation right to run a single backprop step is the worst thing one may do for increasing performance -- you will pay all the process-related overhead costs + all memory-fetch costs, but never, NEVER, reuse a cache-line, as you right next terminate the process.. – user3666197 Nov 09 '17 at 09:56

Parallelizing for loop in Python

2 Answers2

Never start coding before you know it makes sense to do so:

Last, but not least: