0

I was able to improve a code written in python a lot with numpy because of the dot product. Now I still have one part of the code which is still very slow. I still don't understand multithreading and if this could help here. In my opinion this should be possible here. Do you have a nice idea what to do here?

for x1 in range(a**l):
    for x2 in range(a**l):
        for x3 in range(a**l):
            f11 = 0
            cv1 = numpy.ndarray.sum(
            numpy.absolute(numpy.subtract(ws[x1], ws[x2])))
            cv2 = numpy.ndarray.sum(
            numpy.absolute(numpy.subtract(ws[x1], ws[x3])))
            if cv1 == 0:
                f11 += 1
            if cv2 == 0:
                f11 += 1
            re[x1][x2][x3] = 1.0*r/(a**l-2)*(numpy.product(numpy.absolute(
                        numpy.subtract((2*ws[x1]+ws[x2]+ws[x3]), 2)))-f11)
            f11 *= 1.0*(1-r)/2
            re[x1][x2][x3] += f11
Rahul K P
  • 15,740
  • 4
  • 35
  • 52
HighwayJohn
  • 881
  • 1
  • 9
  • 22
  • 3
    A triply nested loop with each level running `a ** l` times? Unless `a` is `1`, or `a` and `l` are very small, this is a ridiculously large amount of work to do, `O((a**l)**3)`. For `a` of 3, `l` of 10, you'd be looking at 205 _trillion_ executions of the inner loop work; even if you did nothing in the body of the inner loop, that's a lot, and you're not doing nothing. – ShadowRanger May 06 '16 at 21:11
  • Stop guessing and [*try this*](http://stackoverflow.com/a/4299378/23771). – Mike Dunlavey May 06 '16 at 21:34
  • a is usually 2 and l up to 10... – HighwayJohn May 06 '16 at 21:40
  • 1
    @HighwayJohn: Usually 2? Or always 2? Because like I said, it makes a big difference. `a` of 2 and `l` of 10 is "only" a billion executions of the inner loop, but if `a` might be any larger value, the scaling factor is enormous. Even just generating and discarding the numbers the fastest possible way (`deque(itertools.product(range(2**10), repeat=3), 0)`) takes over 10 seconds on my machine. For `a == 3` (47.5 bits of work), you'd expect it to take about 192,000x longer, about 22 days. Don't even think about `a == 4`. – ShadowRanger May 06 '16 at 21:52
  • So far, and that will probably not change it is always 2 luckily. If it would be larger I would of course make l smaller. Its a simulation, thus I can decide it. Do you have any idea how to improve my code? – HighwayJohn May 06 '16 at 22:15
  • What's the shape of `ws`? how do you initial `re`? Why are you using `re[x1][x2][x3]` syntax instead of `re[x1, x2, x3]`? Is `ws[x1]` is a scalar or array? – hpaulj May 07 '16 at 03:05
  • That is how I initial re: `re = numpy.zeros((a**l, a**l, a**l))`. `ws = wp[:, 3:l+3]` looks like this and wp looks like `wp = (numpy.arange(2**l)[:,None] >> numpy.arange(l)[::-1]) & 1 wp = numpy.hstack([wp.sum(1,keepdims=True), wp])` – HighwayJohn May 07 '16 at 11:39

2 Answers2

2

I'm attempted to re-create the conditions that the question was interested in, but first a smaller test case to illustrate a strategy. First the author's original implementation:

import numpy as np
import numba as nb
import numpy

def func(re, ws, a, l, r):

    for x1 in range(a**l):
        for x2 in range(a**l):
            for x3 in range(a**l):
                f11 = 0
                cv1 = numpy.ndarray.sum(
                numpy.absolute(numpy.subtract(ws[x1], ws[x2])))
                cv2 = numpy.ndarray.sum(
                numpy.absolute(numpy.subtract(ws[x1], ws[x3])))
                if cv1 == 0:
                    f11 += 1
                if cv2 == 0:
                    f11 += 1
                re[x1][x2][x3] = 1.0*r/(a**l-2)*(numpy.product(numpy.absolute(
                            numpy.subtract((2*ws[x1]+ws[x2]+ws[x3]), 2)))-f11)
                f11 *= 1.0*(1-r)/2
                re[x1][x2][x3] += f11

Now with a simple translation to Numba, which is really well suited to these types of deeply nested looping problems when you're dealing with numpy arrays and numerical calculations:

@nb.njit
def func2(re, ws, a, l, r):
    for x1 in range(a**l):
        for x2 in range(a**l):
            for x3 in range(a**l):
                f11 = 0.0
                cv1 = np.sum(np.abs(ws[x1] - ws[x2]))
                cv2 = np.sum(np.abs(ws[x1] - ws[x3]))

                if cv1 == 0:
                    f11 += 1
                if cv2 == 0:
                    f11 += 1
                y = np.prod(np.abs(2*ws[x1]+ws[x2]+ws[x3] -  2)) - f11
                re[x1,x2,x3] = 1.0*r/(a**l-2)*y
                f11 *= 1.0*(1-r)/2
                re[x1,x2,x3] += f11

and then with some further optimizations to get rid of temporary array creation:

@nb.njit
def func3(re, ws, a, l, r):
    for x1 in range(a**l):
        for x2 in range(a**l):
            for x3 in range(a**l):
                f11 = 0.0
                cv1 = 0.0
                cv2 = 0.0
                for i in range(ws.shape[1]):
                    cv1 += np.abs(ws[x1,i] - ws[x2,i])
                    cv2 += np.abs(ws[x1,i] - ws[x3,i])

                if cv1 == 0:
                    f11 += 1
                if cv2 == 0:
                    f11 += 1
                y = 1.0
                for i in range(ws.shape[1]):
                    y *= np.abs(2.0*ws[x1,i] + ws[x2,i] + ws[x3,i] - 2)
                y -= f11
                re[x1,x2,x3] = 1.0*r/(a**l-2)*y
                f11 *= 1.0*(1-r)/2
                re[x1,x2,x3] += f11

So some simple test data:

a = 2
l = 5
r = 0.2
wp = (numpy.arange(2**l)[:,None] >> numpy.arange(l)[::-1]) & 1
wp = numpy.hstack([wp.sum(1,keepdims=True), wp])
ws = wp[:, 3:l+3]
re = numpy.zeros((a**l, a**l, a**l))

and now let's check that all three functions produce the same result:

re = numpy.zeros((a**l, a**l, a**l))
func(re, ws, a, l, r)

re2 = numpy.zeros((a**l, a**l, a**l))
func2(re2, ws, a, l, r)

re3 = numpy.zeros((a**l, a**l, a**l))
func3(re3, ws, a, l, r)

print np.allclose(re, re2)  # True
print np.allclose(re, re3)  # True

And some initial timings using the jupyter notebook %timeit magic:

%timeit func(re, ws, a, l, r)
%timeit func2(re2, ws, a, l, r)
%timeit func3(re3, ws, a, l, r)

1 loop, best of 3: 404 ms per loop
100 loops, best of 3: 14.2 ms per loop
1000 loops, best of 3: 605 µs per loop

func2 is ~28x times faster than the original implementation. func3 is ~680x faster. Note that I'm running on a Macbook laptop with an i7 processor, 16 GB of RAM and using Numba 0.25.0.

Ok, so now let's time the a=2 l=10 case that everyone is wringing their hands about:

a = 2
l = 10
r = 0.2
wp = (numpy.arange(2**l)[:,None] >> numpy.arange(l)[::-1]) & 1
wp = numpy.hstack([wp.sum(1,keepdims=True), wp])
ws = wp[:, 3:l+3]
re = numpy.zeros((a**l, a**l, a**l))
print 'setup complete'

%timeit -n 1 -r 1 func3(re, ws, a, l, r)

# setup complete
# 1 loop, best of 1: 45.4 s per loop

So this took 45 seconds on my machine single threaded, which seems reasonable if you aren't then doing this one calculation too many times.

JoshAdel
  • 66,734
  • 27
  • 141
  • 140
  • Wow thanks a lot!! I will try this as fast out as possible :) – HighwayJohn May 08 '16 at 15:50
  • What if you are dealing with json in a nested for-loop, where you need to get values of some keys and put them into a dataframe ? What is the best approach for that ? – ShashankAC Feb 24 '20 at 06:36
  • @ShashankAC that's a largely unrelated problem that should be posed in a new question. – JoshAdel Feb 24 '20 at 20:06
  • Alright, I asked a separate question. https://stackoverflow.com/questions/60387457/how-to-optimize-a-nested-for-loop-looping-over-json-data-to-extract-values-of-c – ShashankAC Feb 25 '20 at 04:20
1

Before worrying about multiprocessing, try using what plain numpy offers.

First, make sure ws are numpy arrays, not lists of some such. Then

cv1 = numpy.ndarray.sum(numpy.absolute(numpy.subtract(ws[x1], ws[x2])))
if cv1 == 0:
   f11 += 1

becomes f11 = np.nonzero(ws[x1] == ws[x2]).

Do the same for the rest of the code, and you'll be able to see more structure: np.product is just * and so on.

Then, re[x1][x2][x3] is not how you normally index numpy arrays, use re[x1, x2, x3]. This alone would save you quite a bit of time and memory allocations.

Once this is done, look if you can actually vectorize the expressions, i.e. use numpy all-array operations instead of plain python loops (it's likely that you can, but it's too hard to see with the current state of the code snippet).

ev-br
  • 24,968
  • 9
  • 65
  • 78