Improve performance of function without parallelization

Question

Some weeks ago I posted a question (Speed up nested for loop with elements exponentiation) which got a very good answer by abarnert. This question is related to that one since it makes use of the performance improvements suggested by said user.

I need to improve the performance of a function that involves calculating three factors and then applying an exponential on them.

Here's a MWE of my code:

import numpy as np
import timeit

def random_data(N):
    # Generate some random data.
    return np.random.uniform(0., 10., N)

# Data lists.
array1 = np.array([random_data(4) for _ in range(1000)])
array2 = np.array([random_data(3) for _ in range(2000)])

# Function.
def func():
    # Empty list that holds all values obtained in for loop.    
    lst = []
    for elem in array1:
        # Avoid numeric errors if one of these values is 0.            
        e_1, e_2 = max(elem[0], 1e-10), max(elem[1], 1e-10)
        # Obtain three parameters.
        A = 1./(e_1*e_2)
        B = -0.5*((elem[2]-array2[:,0])/e_1)**2
        C = -0.5*((elem[3]-array2[:,1])/e_2)**2
        # Apply exponential.
        value = A*np.exp(B+C)
        # Store value in list.
        lst.append(value)

    return lst

# time function.
func_time = timeit.timeit(func, number=100)
print func_time

Is it possible to speed up func without having to recurr to parallelization?

You return from your function after the first iteration, is that intended? — Bas Swinckels, Feb 17 '14 at 20:34
Gah, no sorry bad indent. I'll fix it right now. Thanks for the heads up! — Gabriel, Feb 17 '14 at 20:36
Don't think it will make much of a difference here, but prefer `xrange` over `range` where possible. — PaF, Feb 17 '14 at 20:39
@Bas yep, fixed now. I go a bit overboard simplifying my code sometimes, sorry for not checking this more thoroughly. — Gabriel, Feb 17 '14 at 20:39
You probably also want to move the `lst = []` inside the function. Like this, you keep on adding to the same list, which will keep on growing every time you do timeit. — Bas Swinckels, Feb 17 '14 at 20:47
This article should give you some good ideas: http://ianozsvald.com/HighPerformancePythonfromTrainingatEuroPython2011_v0.2.pdf — jaho, Feb 17 '14 at 20:58

Claudiu · Accepted Answer · 2014-02-17T22:54:16.357

Here's what I have so far. My approach is to do as much of the math as possible across numpy arrays.

Optimizations:

Calculate As within numpy
Re-factor calculation of B and C by splitting them into factors, some of which can be computed within numpy

Code:

def optfunc():
    e0 = array1[:, 0]
    e1 = array1[:, 1]
    e2 = array1[:, 2]
    e3 = array1[:, 3]

    ar0 = array2[:, 0]
    ar1 = array2[:, 1]

    As = 1./(e0 * e1)
    Bfactors = -0.5 * (1 / e0**2)
    Cfactors = -0.5 * (1 / e1**2)

    lst = []
    for i, elem in enumerate(array1):
        B = ((elem[2] - ar0) ** 2) * Bfactors[i]
        C = ((elem[3] - ar1) ** 2) * Cfactors[i]

        value = As[i]*np.exp(B+C)

        lst.append(value)

    return lst

print np.allclose(optfunc(), func())

# time function.
func_time = timeit.timeit(func, number=10)
opt_func_time = timeit.timeit(optfunc, number=10)
print "%.3fs --> %.3fs" % (func_time, opt_func_time)

Result:

True
0.759s --> 0.485s

At this point I'm stuck. I managed to do it entirely without python for loops, but it is slower than the above version for a reason I do not yet understand:

def optfunc():
    x = array1
    y = array2

    x0 = x[:, 0]
    x1 = x[:, 1]
    x2 = x[:, 2]
    x3 = x[:, 3]

    y0 = y[:, 0]
    y1 = y[:, 1]

    A = 1./(x0 * x1)
    Bfactors = -0.5 * (1 / x0**2)
    Cfactors = -0.5 * (1 / x1**2)

    B = (np.transpose([x2]) - y0)**2 * np.transpose([Bfactors])
    C = (np.transpose([x3]) - y1)**2 * np.transpose([Cfactors])

    return np.transpose([A]) * np.exp(B + C)

Result:

True
0.780s --> 0.558s

However note that the latter gets you an np.array whereas the former only gets you a Python list... this might account for the difference but I'm not sure.

Thank you very much @Claudiu, I've gotten to a point where every bit of optimization matters so your answer will definitely help me. I've removed two factors from your answer that weren't being used. Cheers. — Gabriel, Feb 18 '14 at 12:42
Minor comment: in my question I defined `e_1` with `max(elem[0], 1e-10)` (same for `e_2`) to avoid numerical errors if either one was zero. What will happen in this case with your answer? — Gabriel, Feb 18 '14 at 12:49
@Gabriel: Notice my output printed `np.allclose(optfunc(), func())`, and got `True` - meaning that both our answers were close within a certain tolerance. I'm not sure what answer you expect if one of those is zero, since you're dividing by zero and then getting infinity.. but anyway if you want the exact same behavior, just do `e0 = array1[:, 0]; e0[e0 < 1e-10] = 1e-10` to set all elements below `1e-10` to `1e-10` — Claudiu, Feb 18 '14 at 17:50

Improve performance of function without parallelization

1 Answers1

Linked