Speed up nested for loop with elements exponentiation

Question

I'm working on a large code and I find myself in the need to speed up a specific bit of it. I've created a MWE shown below:

import numpy as np
import time

def random_data(N):
    # Generate some random data.
    return np.random.uniform(0., 10., N).tolist()

# Lists that contain all the data.
list1 = [random_data(10) for _ in range(1000)]
list2 = [random_data(1000), random_data(1000)]

# Start taking the time.
tik = time.time()

list4 = []
# Loop through all elements in list1.
for elem in list1:

    list3 = []
    # Loop through elements in list2.
    for elem2 in zip(*list2):

        A = np.exp(-0.5*((elem[0]-elem2[0])/elem[3])**2)
        B = np.exp(-0.5*((elem[1]-elem2[1])/elem[3])**2)
        list3.append(A*B)

    # Sum elements in list3 and append result to list4.
    sum_list3 = sum(list3) if sum(list3)>0. else 1e-06
    list4.append(sum_list3)

# Print the elapsed time.
print time.time()-tik

The weird format of list1 and list2 is because that's how this block of code receives them.

The obvious part where most of the time is spent is in the recursive calculation of the A and B terms.

Is there any way I could speed up this block of code without having to parallelize it (I've tried it before and it gave me a lot of troubles)? I'm open to use any package, numpy, scipy, etc..

Add

This is the result of applying abarnert's optimizations and also Jaime's advise to make only one exponentiation. The optimized function is on average ~60x faster on my system.

import numpy as np
import timeit

def random_data(N):
    return np.random.uniform(0., 10., N).tolist()

# Lists that contain all the data.
list1 = [random_data(10) for _ in range(1000)]
list2 = [random_data(1000), random_data(1000)]

array1 = np.array(list1)
array2 = np.array(zip(*list2))


# Old non-optimezed function.
def func1():
    list4 = []
    # Process all elements in list1.
    for elem in list1:
        # Process all elements in list2.
        list3 = []
        for elem2 in zip(*list2):
            A = np.exp(-0.5*((elem[0]-elem2[0])/elem[3])**2)
            B = np.exp(-0.5*((elem[1]-elem2[1])/elem[3])**2)
            list3.append(A*B)
        # Sum elements in list3 and append result to list4.
        sum_list3 = sum(list3) if sum(list3)>0. else 1e-06
        list4.append(sum_list3)

# New optimized function.
def func2():
    list4 = []
    # Process all elements in list1.
    for elem in array1:

        # Broadcast over elements in array2.
        A = -0.5*((elem[0]-array2[:,0])/elem[3])**2
        B = -0.5*((elem[1]-array2[:,1])/elem[3])**2
        array3 = np.exp(A+B)

        # Sum elements in array3 and append result to list4.
        sum_list3 = max(array3.sum(), 1e-10)
        list4.append(sum_list3)


# Get time for both functions.
func1_time = timeit.timeit(func1, number=10)
func2_time = timeit.timeit(func2, number=10)

# Print hom many times faster func2 is versus func1.
print func1_time/func2_time

Why is this code all list-based when you have a NumPy dependency right at the top? — user2357112, Jan 21 '14 at 21:59
No particular reason, that's just how this code receives `list1` and `list2` from another piece of code. About `list3` and `list4`, that's the best way I could figure out how to fill them. They can all be converted to numpy arrays if you think that would make a difference. — Gabriel, Jan 21 '14 at 22:05
@Gabriel: Of course it would make a difference. That's the whole point of using `numpy`—if you can broadcast a computation over an array, you replace a Python loop with a C loop, and remove all the boxing/unboxing around each arithmetic computation, meaning your code typically gets anywhere from 4-400x faster. — abarnert, Jan 21 '14 at 22:09
+1 for posting a good MWE. A great example of putting in the work on asking a question and getting a fantastic answer as a result. Bookmarked so I can link to it in the future. — YXD, Jan 22 '14 at 00:19
One last side note: You shouldn't try to time things yourself using `time.time`. The [`timeit`](http://docs.python.org/3/library/timeit.html) module (or, if you're using IPython, the `%timeit` magic statement) makes sure to pick the right timer, takes care of a bunch of issues you wouldn't have even thought up, lets you repeat the tests and summarize them properly, and makes things simpler to boot. (When your code is taking 100x longer than you expected, it's usually not that big a deal, but it's worth getting in the habit of always reaching for `timeit`.) — abarnert, Jan 22 '14 at 01:12
You are absolutely correct, so far I've never used that module but I'll try to apply it now to calculate how many times faster the optimized function is. I'll add this to the question. — Gabriel, Jan 22 '14 at 01:20

abarnert · Accepted Answer · 2014-01-21T22:40:08.143

7

You want to gradually convert this over from using lists and loops to using arrays and broadcasting, grabbing the easiest and/or most time-critical parts first until it's fast enough.

The first step is to not do that zip(*list2) over and over (especially if this is Python 2.x). While we're at it, we might as well store it in an array, and do the same with list1—you can still iterate over them for now. So:

array1 = np.array(list1)
array2 = np.array(zip(*list2))
# …
for elem in array1:
    # …
    for elem2 in array2:

This won't speed things up much—on my machine, it takes us from 14.1 seconds to 12.9—but it gives us somewhere to start working.

You should also remove the double calculation of sum(list3):

sum_list3 = sum(list3)
sum_list3 = sum_list3 if sum_list3>0. else 1e-06

Meanwhile, it's a bit odd that you want value <= 0 to go to 1e-6, but 0 < value < 1e-6 to be left alone. Is that really intentional? If not, you can fix that, and simplify the code at the same time, by just doing this:

sum_list3 = max(array3.sum(), 1e-06)

Now, let's broadcast the A and B calculations:

# Broadcast over elements in list2.
A = np.exp(-0.5*((elem[0]-array2[:,0])/elem[3])**2)
B = np.exp(-0.5*((elem[1]-array2[:, 1])/elem[3])**2)
array3 = A*B

# Sum elements in list3 and append result to list4.
sum_list3 = max(array3.sum(), 1e-06)

list4.append(sum_list3)

And this gets us down from 12.9 seconds to 0.12. You could go a step further by also broadcasting over array1, and replacing list4 with a pre-allocated array, and so forth, but this is probably already fast enough.

edited Jan 21 '14 at 22:40

answered Jan 21 '14 at 22:07

abarnert

354,177
51
601
671

Saulo: Clever improvement in your edit (using `max(array3.sum(), 1e-06)`, but it changes the semantics a bit. With the OP's code, say, 1e-12 will be left alone; with yours, it will be changed to 1e-6. That could be a bug in the OP's code rather than intended functionality, but I don't want to just assume that. I added an explanation into the answer. – abarnert Jan 21 '14 at 22:38
4

Exponentiation is expensive: instead of taking `np.exp` twice and then multiply, add the two values together and then take `np.exp`. – Jaime Jan 21 '14 at 23:12
1

@Jaime: We've already got a 99% speedup, and I doubt he needs a further 10%. (I estimated that quickly—skipping the `exp` cuts the time to .79x, so doing one instead of two is probably about .9x.) But I'll bet you could get a whole lot more improvement by moving it out a level (broadcast `exp` over all the rows at once, then `sum` all the rows, instead of doing it once per row), so I think if the existing improvement isn't enough, broadcasting over `array1` and `array4` should be the first step, then look to optimizing the math within them. – abarnert Jan 21 '14 at 23:19
Amazing answer! I'll check it out in depth as soon as I get home. As for the `1e-06` number, I'm just trying to avoid `0` values in `sum_list3` because they cause troubles later on, so I just replace any instance where the sum adds up to zero to a small value. The value `1e-06` was chosen at random, it has no particular significance. – Gabriel Jan 21 '14 at 23:33
2

If you vectorize the full thing that gains you about an extra 20%, which is, I have to agree, negligible over the first 99%. But this is one of my pet peeves, like not taking the square root of a distance if having it squared is enough for what you are doing, I just couldn't resist, sorry for the noise... – Jaime Jan 21 '14 at 23:35
1

@Jaime: It's not noise; it's worth having additional optimizations in the comments—in case the OP needs them, and also just for the OP to learn from. I was just explaining why I didn't think it needed to go into the answer, it's probably good enough as a comment. – abarnert Jan 21 '14 at 23:52
@Jaime how could I vectorize the block of code? I've never done that before. Just for the record: I'm after *every drop of optimization* I can squeeze out of python. I have to run this block of code several thousand times so any gain no matter how small will be multiplied in my actual code. – Gabriel Jan 22 '14 at 00:18
2

@Gabriel: People often say that without thinking, but consider this: Is it worth 2 hours of work to speed up your code for, say, a gain of 180 microseconds per run? Do that 10000 times, and you've saved at best a total of negative 3598.2 seconds—and you've increased the chances of it being incorrect, and the amount of time it will take you to debug it (especially if you find the new version harder to understand). – abarnert Jan 22 '14 at 00:36
@abarnert yes, that makes sense. I guess it would only matter if the speed up is enough to counter the time spent, but I can't know unless Jaime tells me how its done :) By the way, please take a look at the question, I've updated it with some follow up to your answer, it's not working as expected (actually it's working exactly backwards) – Gabriel Jan 22 '14 at 00:40
Please ignore the above comment. The speed up shows when I apply the _whole_ code you posted, not just the first part. – Gabriel Jan 22 '14 at 00:56
1

@Gabriel: The key thing here is trying to figure out how I guessed what would be worth converting from loops to broadcasting, how I converted it, why it works, and why it's faster. If you know all that, you can optimize things efficiently whenever you need to. (Of course you'll still need to google or ask at SO for help sometimes—at least _I_ certainly need to… but you know what I mean.) So if part of it isn't clear, ask for clarifications; that'll probably be worth more in the long run (and maybe even the short run) than asking for further speedups to this code. – abarnert Jan 22 '14 at 01:09

Speed up nested for loop with elements exponentiation

1 Answers1

Linked