how to speed up nested loops in Python

Question

I am trying to write a piece of nested loops in my algorithm, and meet some problems that the whole algorithm takes too long time due to these nested loops. I am quite new to Python (as you may find from my below unprofessional code :( ) and hopefully someone can guide me a way to speed up my code!

The whole algorithm is for fire detection in multi 1500*6400 arrays. A small contextual analyse is applied when go through the whole array. The contextual analyse is performed in a dynamically assigned windows size way. The windows size can go from 11*11 to 31*31 until the validate values inside the sampling windows are enough for the next round calculation, for example like below:

def ContextualWindows (arrb4,arrb5,pfire):
####arrb4,arrb5,pfire are 31*31 sampling windows from large 1500*6400 numpy array
    i=5
    while i in range (5,16):
        arrb4back=arrb4[15-i:16+i,15-i:16+i]
        ## only output the array data when it is 'large' enough 
        ## to have enough good quality data to do calculation       
        if np.ma.count(arrb4back)>=min(10,0.25*i*i):
            arrb5back=arrb5[15-i:16+i,15-i:16+i]
            pfireback=pfire[15-i:16+i,15-i:16+i]
            canfire=0
            i=20
        else: 
            i=i+1

###unknown pixel: background condition could not be characterized
    if i!=20: 
        canfire=1
        arrb5back=arrb5
        pfireback=pfire
        arrb4back=arrb4
    return (arrb4back,arrb5back,pfireback,canfire)

Then this dynamic windows will be feed into next round test, for example:

b4backave=np.mean(arrb4Windows)
b4backdev=np.std(arrb4Windows)
if b4>b4backave+3.5*b4backdev:
    firetest=True

To run the whole code to my multi 1500*6400 numpy arrays, it took over half an hour, or even longer. Just wondering if anyone got an idea how to deal with it? A general idea which part I should put my effort to would be greatly helpful!

Many thanks!

Might be better for [codereview](http://codereview.stackexchange.com/)? — hd1, Apr 05 '15 at 16:24
not a change to the complexity but I would use `while 5 <= i < 16` — Padraic Cunningham, Apr 05 '15 at 16:40
You could also change it to a for loop and a break, with else clause i = 20 and then reverse the test... There are no nested loops, just lots of slicing, numpy is supposed to create views, so shouldn't be that slow. — AChampion, Apr 05 '15 at 16:44
Very hard to answer the question in the current state. Can you add sample input for the function and any imports to your code? Make it so we can copy, paste and run your code without needing to guess the format of your data. Try pasting it into a file yourself and if it does not run then edit it until it does and the update the code in the question to match. See [this question](http://stackoverflow.com/q/19268937/553404) for a data-as-code example, [these R guidelines](http://stackoverflow.com/a/5963610/553404) and [this SO help page](http://stackoverflow.com/help/mcve). — YXD, Apr 05 '15 at 16:46
`codereview` doesn't get enough traffic. `numpy` speedup questions like this are quite common here on `SO`. — hpaulj, Apr 05 '15 at 17:28
I'd focus on speeding up the inside of the `while` loop. Iterating 10 times isn't much of a problem if the if each iteration step is fast. But if this is a sliding window problem, there is a `strided` method that may be able to 'vectorize' it. — hpaulj, Apr 05 '15 at 17:38
@hpaulj that was true 3 years ago. 45K visitors/day is quite a bit of traffic in StackLand. — Mathieu Guindon, Apr 05 '15 at 18:37
But `numpy` traffic is still weak, 134 questions tagged v 16,000 here. Same for followers. — hpaulj, Apr 05 '15 at 19:08
A plus for moving this to `codereview` is that the standards for the question are tighter. A non-runnable excerpt like this would not be tolerated. — hpaulj, Apr 05 '15 at 19:33
SO is in its own league. You can't compare a site that has dozens of millions of users against any other. But a lot of answerers on Code Review don't care what the language is.. especially in the [performance] tag. You have a point about the *example code* though, but if everyone thought "I'll post on SO, CR is too small", then CR wouldn't have graduated like it did. CR doubled in size over the past 18 months... and I bet there would be more votes on CR if the question was on-topic and answered there. — Mathieu Guindon, Apr 05 '15 at 19:36

score 1 · Answer 1 · answered Apr 05 '15 at 18:42

Avoid while loops if speed is a concern. The loop lends itself to a for loop as start and end are fixed. Additionally, your code does a lot of copying which isn't really necessary. The rewritten function:

def ContextualWindows (arrb4,arrb5,pfire):
    ''' arrb4,arrb5,pfire are 31*31 sampling windows from 
        large 1500*6400 numpy array '''

    for i in range (5, 16):
        lo = 15 - i  # 10..0
        hi = 16 + i  # 21..31
        #  only output the array data when it is 'large' enough
        #  to have enough good quality data to do calculation
        if np.ma.count(arrb4[lo:hi, lo:hi]) >= min(10, 0.25*i*i):
            return (arrb4[lo:hi, lo:hi], arrb5[lo:hi, lo:hi], pfire[lo:hi, lo:hi], 0)
    else:    #  unknown pixel: background condition could not be characterized
        return (arrb4, arrb5, pfire, 1)

For clarity I've used style guidelines from PEP 8 (like extended comments, number of comment chars, spaces around operators etc.). Copying of a windowed arrb4 occurs twice here but only if the condition is fulfilled and this will happen only once per function call. The else clause will be executed only if the for-loop has run to it's end. We don't even need a break from the loop as we exit the function altogether.
Let us know if that speeds up the code a bit. I don't think it'll be much but then again there isn't much code anyway.

The end points aren't fixed. He starts at 5, but can quit before 16. A for-loop with a break should work, but I don't think it will make much difference in speed. — hpaulj, Apr 05 '15 at 20:09
My suggestion contains a `break` as well in the `if...then` statement, only that it returns from the function altogether. It wouldn't work without as the OP wants to quit prematurely when the conditions is met. So the loop may run to it's end or may quit before that, just as in the original code. In my opinion the flow lends itself to a `for` loop with a `break` more than to a `while` loop. Avoiding to copy parts of the arrays will probably have more effect than this, granted. — user1016274, Apr 05 '15 at 20:17
In my tests, 1 iteration takes 50us, all 10 500us. Breaking early if possible is a good thing. For with break is clearer syntax, but not much different in speed. — hpaulj, Apr 05 '15 at 20:19

hpaulj · Answer 2 · 2015-04-06T06:23:49.200

I've run some time tests with ContextualWindows and variants. One i step takes about 50us, all ten about 500.

This simple iteration takes about the same time:

[np.ma.count(arrb4[15-i:16+i,15-i:16+i]) for i in range(5,16)]

The iteration mechanism, and the 'copying' arrays are minor parts of the time. Where possible numpy is making views, not copies.

I'd focus on either minimizing the number of these count steps, or speeding up the count.

Comparing times for various operations on these windows:

First time for 1 step:

In [167]: timeit [np.ma.count(arrb4[15-i:16+i,15-i:16+i]) for i in range(5,6)]
10000 loops, best of 3: 43.9 us per loop

now for the 10 steps:

In [139]: timeit [arrb4[15-i:16+i,15-i:16+i].shape for i in range(5,16)]
10000 loops, best of 3: 33.7 us per loop

In [140]: timeit [np.sum(arrb4[15-i:16+i,15-i:16+i]>500) for i in range(5,16)]
1000 loops, best of 3: 390 us per loop

In [141]: timeit [np.ma.count(arrb4[15-i:16+i,15-i:16+i]) for i in range(5,16)]
1000 loops, best of 3: 464 us per loop

Simply indexing does not take much time, but testing for conditions takes substantially more.

cumsum is sometimes used to speed up sums over sliding windows. Instead of taking sum (or mean) over each window, you calculate the cumsum and then use the differences between the front and end of window.

Trying something like that, but in 2d - cumsum in both dimensions, followed by differences between diagonally opposite corners:

In [164]: %%timeit
   .....: cA4=np.cumsum(np.cumsum(arrb4,0),1)
   .....: [cA4[15-i,15-i]-cA4[15+i,15+i] for i in range(5,16)]
   .....: 
10000 loops, best of 3: 43.1 us per loop

This is almost 10x faster than the (nearly) equivalent sum. Values don't quite match, but timing suggest that this may be worth refining.

how to speed up nested loops in Python

2 Answers2