Performance issue in python with nested loop

Question

I was able to improve a code written in python a lot with numpy because of the dot product. Now I still have one part of the code which is still very slow. I still don't understand multithreading and if this could help here. In my opinion this should be possible here. Do you have a nice idea what to do here?

for x1 in range(a**l):
    for x2 in range(a**l):
        for x3 in range(a**l):
            f11 = 0
            cv1 = numpy.ndarray.sum(
            numpy.absolute(numpy.subtract(ws[x1], ws[x2])))
            cv2 = numpy.ndarray.sum(
            numpy.absolute(numpy.subtract(ws[x1], ws[x3])))
            if cv1 == 0:
                f11 += 1
            if cv2 == 0:
                f11 += 1
            re[x1][x2][x3] = 1.0*r/(a**l-2)*(numpy.product(numpy.absolute(
                        numpy.subtract((2*ws[x1]+ws[x2]+ws[x3]), 2)))-f11)
            f11 *= 1.0*(1-r)/2
            re[x1][x2][x3] += f11

A triply nested loop with each level running `a ** l` times? Unless `a` is `1`, or `a` and `l` are very small, this is a ridiculously large amount of work to do, `O((a**l)**3)`. For `a` of 3, `l` of 10, you'd be looking at 205 _trillion_ executions of the inner loop work; even if you did nothing in the body of the inner loop, that's a lot, and you're not doing nothing. — ShadowRanger, May 06 '16 at 21:11
Stop guessing and [*try this*](http://stackoverflow.com/a/4299378/23771). — Mike Dunlavey, May 06 '16 at 21:34
@HighwayJohn: Usually 2? Or always 2? Because like I said, it makes a big difference. `a` of 2 and `l` of 10 is "only" a billion executions of the inner loop, but if `a` might be any larger value, the scaling factor is enormous. Even just generating and discarding the numbers the fastest possible way (`deque(itertools.product(range(2**10), repeat=3), 0)`) takes over 10 seconds on my machine. For `a == 3` (47.5 bits of work), you'd expect it to take about 192,000x longer, about 22 days. Don't even think about `a == 4`. — ShadowRanger, May 06 '16 at 21:52
So far, and that will probably not change it is always 2 luckily. If it would be larger I would of course make l smaller. Its a simulation, thus I can decide it. Do you have any idea how to improve my code? — HighwayJohn, May 06 '16 at 22:15
What's the shape of `ws`? how do you initial `re`? Why are you using `re[x1][x2][x3]` syntax instead of `re[x1, x2, x3]`? Is `ws[x1]` is a scalar or array? — hpaulj, May 07 '16 at 03:05
That is how I initial re: `re = numpy.zeros((a**l, a**l, a**l))`. `ws = wp[:, 3:l+3]` looks like this and wp looks like `wp = (numpy.arange(2**l)[:,None] >> numpy.arange(l)[::-1]) & 1 wp = numpy.hstack([wp.sum(1,keepdims=True), wp])` — HighwayJohn, May 07 '16 at 11:39

JoshAdel · Accepted Answer · 2016-05-08T14:07:28.067

I'm attempted to re-create the conditions that the question was interested in, but first a smaller test case to illustrate a strategy. First the author's original implementation:

import numpy as np
import numba as nb
import numpy

def func(re, ws, a, l, r):

    for x1 in range(a**l):
        for x2 in range(a**l):
            for x3 in range(a**l):
                f11 = 0
                cv1 = numpy.ndarray.sum(
                numpy.absolute(numpy.subtract(ws[x1], ws[x2])))
                cv2 = numpy.ndarray.sum(
                numpy.absolute(numpy.subtract(ws[x1], ws[x3])))
                if cv1 == 0:
                    f11 += 1
                if cv2 == 0:
                    f11 += 1
                re[x1][x2][x3] = 1.0*r/(a**l-2)*(numpy.product(numpy.absolute(
                            numpy.subtract((2*ws[x1]+ws[x2]+ws[x3]), 2)))-f11)
                f11 *= 1.0*(1-r)/2
                re[x1][x2][x3] += f11

Now with a simple translation to Numba, which is really well suited to these types of deeply nested looping problems when you're dealing with numpy arrays and numerical calculations:

@nb.njit
def func2(re, ws, a, l, r):
    for x1 in range(a**l):
        for x2 in range(a**l):
            for x3 in range(a**l):
                f11 = 0.0
                cv1 = np.sum(np.abs(ws[x1] - ws[x2]))
                cv2 = np.sum(np.abs(ws[x1] - ws[x3]))

                if cv1 == 0:
                    f11 += 1
                if cv2 == 0:
                    f11 += 1
                y = np.prod(np.abs(2*ws[x1]+ws[x2]+ws[x3] -  2)) - f11
                re[x1,x2,x3] = 1.0*r/(a**l-2)*y
                f11 *= 1.0*(1-r)/2
                re[x1,x2,x3] += f11

and then with some further optimizations to get rid of temporary array creation:

@nb.njit
def func3(re, ws, a, l, r):
    for x1 in range(a**l):
        for x2 in range(a**l):
            for x3 in range(a**l):
                f11 = 0.0
                cv1 = 0.0
                cv2 = 0.0
                for i in range(ws.shape[1]):
                    cv1 += np.abs(ws[x1,i] - ws[x2,i])
                    cv2 += np.abs(ws[x1,i] - ws[x3,i])

                if cv1 == 0:
                    f11 += 1
                if cv2 == 0:
                    f11 += 1
                y = 1.0
                for i in range(ws.shape[1]):
                    y *= np.abs(2.0*ws[x1,i] + ws[x2,i] + ws[x3,i] - 2)
                y -= f11
                re[x1,x2,x3] = 1.0*r/(a**l-2)*y
                f11 *= 1.0*(1-r)/2
                re[x1,x2,x3] += f11

So some simple test data:

a = 2
l = 5
r = 0.2
wp = (numpy.arange(2**l)[:,None] >> numpy.arange(l)[::-1]) & 1
wp = numpy.hstack([wp.sum(1,keepdims=True), wp])
ws = wp[:, 3:l+3]
re = numpy.zeros((a**l, a**l, a**l))

and now let's check that all three functions produce the same result:

re = numpy.zeros((a**l, a**l, a**l))
func(re, ws, a, l, r)

re2 = numpy.zeros((a**l, a**l, a**l))
func2(re2, ws, a, l, r)

re3 = numpy.zeros((a**l, a**l, a**l))
func3(re3, ws, a, l, r)

print np.allclose(re, re2)  # True
print np.allclose(re, re3)  # True

And some initial timings using the jupyter notebook %timeit magic:

%timeit func(re, ws, a, l, r)
%timeit func2(re2, ws, a, l, r)
%timeit func3(re3, ws, a, l, r)

1 loop, best of 3: 404 ms per loop
100 loops, best of 3: 14.2 ms per loop
1000 loops, best of 3: 605 µs per loop

func2 is ~28x times faster than the original implementation. func3 is ~680x faster. Note that I'm running on a Macbook laptop with an i7 processor, 16 GB of RAM and using Numba 0.25.0.

Ok, so now let's time the a=2 l=10 case that everyone is wringing their hands about:

a = 2
l = 10
r = 0.2
wp = (numpy.arange(2**l)[:,None] >> numpy.arange(l)[::-1]) & 1
wp = numpy.hstack([wp.sum(1,keepdims=True), wp])
ws = wp[:, 3:l+3]
re = numpy.zeros((a**l, a**l, a**l))
print 'setup complete'

%timeit -n 1 -r 1 func3(re, ws, a, l, r)

# setup complete
# 1 loop, best of 1: 45.4 s per loop

So this took 45 seconds on my machine single threaded, which seems reasonable if you aren't then doing this one calculation too many times.

Wow thanks a lot!! I will try this as fast out as possible :) — HighwayJohn, May 08 '16 at 15:50
What if you are dealing with json in a nested for-loop, where you need to get values of some keys and put them into a dataframe ? What is the best approach for that ? — ShashankAC, Feb 24 '20 at 06:36
@ShashankAC that's a largely unrelated problem that should be posed in a new question. — JoshAdel, Feb 24 '20 at 20:06
Alright, I asked a separate question. https://stackoverflow.com/questions/60387457/how-to-optimize-a-nested-for-loop-looping-over-json-data-to-extract-values-of-c — ShashankAC, Feb 25 '20 at 04:20

score 1 · Answer 2 · answered May 07 '16 at 09:05

1

Before worrying about multiprocessing, try using what plain numpy offers.

First, make sure ws are numpy arrays, not lists of some such. Then

cv1 = numpy.ndarray.sum(numpy.absolute(numpy.subtract(ws[x1], ws[x2])))
if cv1 == 0:
   f11 += 1

becomes f11 = np.nonzero(ws[x1] == ws[x2]).

Do the same for the rest of the code, and you'll be able to see more structure: np.product is just * and so on.

Then, re[x1][x2][x3] is not how you normally index numpy arrays, use re[x1, x2, x3]. This alone would save you quite a bit of time and memory allocations.

Once this is done, look if you can actually vectorize the expressions, i.e. use numpy all-array operations instead of plain python loops (it's likely that you can, but it's too hard to see with the current state of the code snippet).

answered May 07 '16 at 09:05

ev-br

24,968
9
65
78

Why is it a different if I write `re[x1][x2][x3]` instead of `re[x1, x2, x3]` in terms of memory allocation? – HighwayJohn May 07 '16 at 14:00
I tried your code and it doesn't work:`Expected 'Union[ndarray, Iterable]', got bool instead`. – HighwayJohn May 07 '16 at 14:10
1

`re[x1][x2][x3]` is equivalent to `re1=re[x1]; re2=re1[x2]; return re2[x3]`, even though you don't see those temporary variables `re1` and `re2`. You do this in the inner loop. – Han-Kwang Nienhuys May 07 '16 at 16:49
Okay I changed this but it didn't really speed it up – HighwayJohn May 07 '16 at 18:48
ws is a `l *a**l` numpy.nsdarray. I think that is also why your code don't work. You can see in the comments above how it is created – HighwayJohn May 07 '16 at 18:52

Performance issue in python with nested loop

2 Answers2

Linked