What methods can be used to speed up this Python code involving nested for loops?

Question

I have the following code which involves summing up over a number of nested for loops.

What are some ways I can speed up the execution of this code? I am interested in not just one method of speeding it up, I'd really like to see a range of methods, e.g. 'Pure Python', Numpy, Scipy, Cython, etc..

This is so that for similar, but (much) more complicated, code I have to write I could choose a speed up option that gives a good trade-off of execution speed vs. complexity of implementation. Anything to save me from having to write C++ code which will cause me to lose the will to live.

def f(a,b,c,d):
    return a+b+c+d

x = [0.04691008, 0.23076534, 0.5,        0.76923466, 0.95308992]
w = [0.11846344, 0.23931434, 0.28444444, 0.23931434, 0.11846344]
numQuadNodes = 5

def tensorGauss(func):
    sum = 0;
    for i in range(0,numQuadNodes):
        for j in range(0,numQuadNodes):
            for k in range(0,numQuadNodes):
                for l in range(0,numQuadNodes):
                    sum += w[i]*w[j]*w[k]*w[l]*func(x[l],x[k],x[j],x[i])

    return sum

print(tensorGauss(f))

Edit - More realistic code As you can see tensorGauss is already much faster than nquad (0.07 secs vs 20.86 secs on my machine), but I really would like to have some ways to make tensorGauss faster again as I will have to compute a ton of tensorGauss evaluations!

import numpy as np
import numpy.linalg as LA
from scipy.integrate import nquad
import time

##################################################
# Triangle vertices
##################################################
v_a_1 = np.array([[4,0,0]]).T
v_a_2 = np.array([[5,1,0]]).T
v_a_3 = np.array([[4,2,0]]).T

v_b_1 = np.array([[4,0,0]]).T
v_b_2 = np.array([[5,-1,0]]).T
v_b_3 = np.array([[4,-2,0]]).T

##################################################
# g_tau
##################################################
def g_tau():
    J_tau = v_a_2-v_a_1
    J_tau = np.append(J_tau, v_a_3-v_a_2,axis=1)
    G = np.dot(J_tau.T,J_tau)
    return np.sqrt(LA.det(G))

##################################################
# g_t
##################################################
def g_t():
    J_t = v_b_2-v_b_1
    J_t = np.append(J_t, v_b_3-v_b_2,axis=1)
    G = np.dot(J_t.T,J_t)
    return np.sqrt(LA.det(G))

##################################################
# chi_tau
##################################################
def chi_tau(x):
    return v_a_1 + (v_a_2-v_a_1)*x[0] + (v_a_3-v_a_2)*x[1]

##################################################
# chi_t
##################################################
def chi_t(y):
    return v_b_1 + (v_b_2-v_b_1)*y[0] + (v_b_3-v_b_2)*y[1]

##################################################
# k_
##################################################
def k_(x,y):
    return LA.norm(x+y)

##################################################
# k
##################################################
def k(x,y):
    return k_(chi_tau(x),chi_t(y))*g_tau()*g_t()

start=time.time()

##################################################
# tensorGauss
##################################################
x = [0.04691008, 0.23076534, 0.5,        0.76923466, 0.95308992]
w = [0.11846344, 0.23931434, 0.28444444, 0.23931434, 0.11846344]
numQuadNodes = 5

def f(z, y, x, w):
    a_1_1 = z;
    a_1_2 = z * w;
    a_2_1 = z * x;
    a_2_2 = z * x * y;

    a_1 = np.array([a_1_1,a_1_2]).T
    a_2 = np.array([a_2_1,a_2_2]).T
    res = k(a_1,a_2)

    a_1_1 = z * x;
    a_1_2 = z * x * y;
    a_2_1 = z;
    a_2_2 = z * w;

    a_1 = np.array([a_1_1,a_1_2]).T
    a_2 = np.array([a_2_1,a_2_2]).T
    res += k(a_1,a_2) 

    a_1_1 = z * y;
    a_1_2 = z * w;
    a_2_1 = z * x;
    a_2_2 = z;

    a_1 = np.array([a_1_1,a_1_2]).T
    a_2 = np.array([a_2_1,a_2_2]).T
    res += k(a_1,a_2)     

    return res

def tensorGauss(func):
    sum = 0;
    for i in range(0,numQuadNodes):
        for j in range(0,numQuadNodes):
            for k in range(0,numQuadNodes):
                for l in range(0,numQuadNodes):
                    sum += w[i]*w[j]*w[k]*w[l]*func(x[l],x[k],x[j],x[i])

    return sum

start=time.time()
tensorGauss_res = tensorGauss(f)
end=time.time()
tensorGauss_time = end-start


start=time.time()
[nquad_res, err] = nquad(f, [[0,1], [0,1], [0,1], [0,1]])
end=time.time()
nquad_time = end-start

print(f'tensor-gauss: {tensorGauss_res}')
print(f'nquad:        {nquad_res}')
print('\n')
print(f'tensor-gauss time: {tensorGauss_time}')
print(f'nquad time:        {nquad_time}')

Is it intentional that there are a lot of duplicate cases in there? Permutations of `i`, `j`, `k`, `l` all yield the same partial result. Is `f` costly in your "real" code, or traversing the data? — MisterMiyagi, Feb 26 '19 at 15:34
I agree. if you can tell us what do you want to calculate on tensorGauss() exactly. we can help you may be. — Youcef Benyettou, Feb 26 '19 at 15:37
Looks quite similar to this: https://stackoverflow.com/a/48811580/4045774 Is the Number of integration points always 5? — max9111, Feb 26 '19 at 15:54
@MisterMiyagi I defined the function f in the above code as simply as possible just for the sake of example. The real f's I have to deal with are much more awkward. I will edit in some new code to show the type of f's I am dealing with. — ManUtdBloke, Feb 26 '19 at 16:01
@max9111 The number of integration points will not always be 5, it could any where from about 2 to 50 I imagine in practise. — ManUtdBloke, Feb 26 '19 at 16:02
@YoucefBenyettou I have edited a true example of what the f function will generally look like. — ManUtdBloke, Feb 26 '19 at 16:10
It looks like all your code is eligible for Cython. You may want to give Cython's parallelism a try: https://cython.readthedocs.io/en/latest/src/userguide/parallelism.html — MisterMiyagi, Feb 26 '19 at 16:16
So in a real world example the v_a_1 and v_b_1 arrays are of shape (3,n), where n is quite big and you want to do the integration on all triangles? — max9111, Feb 26 '19 at 16:40
@max9111 Yes, however, the integrands (not the dummy one above) are singular so they need to be regularized. When they are regularized I end up integrating on on the domain (0,1)^4 as in the above post. — ManUtdBloke, Feb 27 '19 at 08:07

fountainhead · Accepted Answer · 2019-02-26T17:03:42.983

I re-wrote your tensorGauss() function as below:

def tensorGauss(func):

    w_gen = np.meshgrid(w,w,w,w,indexing='ij')
    x_gen = np.meshgrid(x,x,x,x,indexing='ij')
    sum = np.sum(w_gen[0] * w_gen[1] * w_gen[2] * w_gen[3] * 
                            f(x_gen[3], x_gen[2], x_gen[1], x_gen[0]))

    return sum

and it printed a result of 2.0, as opposed to the value of 1.9999999999999971 printed by the simplified tensorGauss that you posted (the one that uses the simplified f() function).

However, some disclaimers:

Whether this would even run without throwing errors, would depend upon your real code (the stuff that's got names like g_tau, etc). I say this because, this solution assumes that your f is a vectorized function that will work element-wise if arrays are passed instead of scalars. I can see that this assumption holds for your dummy f(), but then, I don't know if it will also hold for your real f()
Whether you get any performance benefits or not with this solution is also best checked with your real code instead of the dummy f, and your real data size

What methods can be used to speed up this Python code involving nested for loops?

1 Answers1