Methods of improving the efficiency of a a python-loop computation

Question

I would like to speed up the following code related to spherical modes. It is a simplification of my actual code (I didn't want to oversimplify it because it can lead to solutions that are not valid for my actual problem):

import numpy as np
import time
import math

def function_call(npp,nmax):
    matrix_a = np.random.rand(npp)
    matrix_b = np.random.rand(npp)
    a=np.random.rand()
    F = np.zeros((2*npp, 2*nmax*(nmax+2)),dtype=np.complex_) 
    npa=np.arange(npp)
    for n in range(1,nmax+1,1):
        a_n = np.sqrt(1 / (2 * np.pi * n * (n + 1)))
        for m in range(-n,n+1,1):        
            b_m = (-1)**((np.abs(m) + m) / 2)
            p_mn = int(1 / 2 * (np.abs(m) + n + 1 / 2 * (1 - (-1)**(np.abs(m) + n))))
            alpha_mn = np.sqrt(((2 * n + 1) / 2) * math.factorial(n - np.abs(m)) / math.factorial(n + np.abs(m)))
            A_mn = np.zeros(npp)
            B_mn = np.zeros(npp)
            for p in range(p_mn,n+1,1):
                Cai_pmn = math.factorial(n) * ((-1)**(n + p)) / (math.factorial(p) * math.factorial(n - p)) * math.factorial(2 * p)/math.factorial(2 * p - np.abs(m) - n)
                A_mn = A_mn + Cai_pmn * (np.cos(matrix_a))**(2 * p - np.abs(m) - n)
                B_mn = B_mn + (2 * p - np.abs(m) - n) * Cai_pmn * (np.cos(matrix_a))**(np.abs(2 * p - np.abs(m) - n - 1))
            A_mn = A_mn / (2**n * math.factorial(n))
            B_mn = B_mn / (2**n * math.factorial(n))
            S_mn = alpha_mn * m * A_mn * np.sin(matrix_a)**np.abs(np.abs(m) - 1)
            D_mn = alpha_mn * (np.abs(m) * A_mn * np.cos(matrix_a) * (np.sin(matrix_a))**(np.abs(np.abs(m) - 1)) - B_mn * (np.sin(matrix_a))**(np.abs(m) + 1))
            h1 = 1j**(n+1)*np.exp(-1j*a)/(a)
            h2 = 1j**(n)*np.exp(-1j*a)/(a)
            F_s1_theta = 1j * a_n * b_m * h1 * (S_mn * np.exp(1j * m * matrix_b))
            F_s1_phi =       -a_n * b_m * h1 * (D_mn * np.exp(1j * m * matrix_b))
            F_s2_theta =      a_n * b_m * h2 * (D_mn * np.exp(1j * m * matrix_b))
            F_s2_phi =   1j * a_n * b_m * h2 * (S_mn * np.exp(1j * m * matrix_b))
            j = 2 * (n * (n + 1) + m - 1)
            F[2 * npa, j] = F_s1_theta[npa]
            F[2 * npa+1  , j] = F_s1_phi[npa]
            j = 2 * (n * (n + 1) + m - 1) + 1
            F[2 * npa, j] = F_s2_theta[npa]
            F[2 * npa+1, j] = F_s2_phi[npa]               
prev_time_ep =time.time()
npp=500
nmax=80 
function_call(npp,nmax)           
print("       --- %s seconds ---" % (time.time() - prev_time_ep))

The first option that I have tried is to vectorize it (it took me some time because it is not obvious). However, memory consumption grows rapidly and it is not efficient.

I have also tried with Numba and I actually succeed to reduce by 4 the previous time, but I was looking for larger improvements if it is possible.

I have read also that maybe multiprocessing or Cython could be good options. Maybe there is a way of vectorizing it without growing rapidly in memory usage?

There are a lot of optimization possibilities. 1) You are calculating the very same results multiple times eg. `np.cos(matrix_a)` and some others. In Numba you have to implement factorial yourself https://stackoverflow.com/a/53320252/4045774 You also have some vectorized commands, which are hard to optimize for Numba or Cython. Have you also written a simple solution (only loops no vectorized commands)? This would be the easiest thing to start the optimization and Numba/Cython implementation. — max9111, Dec 30 '20 at 12:35

score 1 · Answer 1 · answered Dec 29 '20 at 20:48

I've worked a little bit on your code, here the benchmark. the bottleneck is in the factorial computation.

    ================== PerfTool ==================
task                     |aver(s) |sum(s)  |count   |std     
main loop                |   0.134|  10.712|      80|   0.101
  +-second loop          |   0.134|  10.712|      80|   0.101
    +-A                  |   0.000|   0.245|    6560|   0.000
    +-B                  |   0.001|   5.648|    6560|   0.001
    +-C                  |   0.000|   0.541|    6560|   0.000
    +-D                  |   0.000|   1.505|    6560|   0.000
    +-E                  |   0.000|   1.769|    6560|   0.000
    +-F                  |   0.000|   0.867|    6560|   0.000
mx creation              |   0.000|   0.000|       1|   0.000
preparation              |   0.000|   0.000|       1|   0.000

overall                  |    0.03|   10.71|   39522|-

The B and C sentinel are:

      with PerfTool('B'):
                for p in range(p_mn,n+1,1):
                    Cai_pmn = math.factorial(n) * ((-1)**(n + p)) / (math.factorial(p) * math.factorial(n - p)) * math.factorial(2 * p)/math.factorial(2 * p - np.abs(m) - n)
                    A_mn = A_mn + Cai_pmn * (np.cos(matrix_a))**(2 * p - np.abs(m) - n)
                    B_mn = B_mn + (2 * p - np.abs(m) - n) * Cai_pmn * (np.cos(matrix_a))**(np.abs(2 * p - np.abs(m) - n - 1))

      with PerfTool('C'):
                A_mn = A_mn / (2**n * math.factorial(n))
                B_mn = B_mn / (2**n * math.factorial(n))

As you can see most of the time is spent on B, so i've add a kind of cache, here:

   rng = np.arange(1,nmax+1,1)
    cache = dict(zip(rng,factorial(rng)))
    def get_factorial(w,cache=cache):
        if w not in cache:
            cache[w] = math.factorial(w)
        return cache[w]

To use instead of math.factorial, this avoid to recalculate same values.

Finally, B was refactored as B_vec, and here there is the root of the evil! i've marked that code as B_vec_slow, that 2 lines takes up most of the time.

            with PerfTool('B_vec'):
                prng = np.arange(p_mn, n+1)
                Cai_pmn_vec = get_factorial(n) * ((-1)**(n + prng)) / (factorial(prng) * factorial(n - prng)) * factorial(2 * prng)/factorial(2 * prng - np.abs(m) - n)
                with PerfTool('B_vec_slow'):
                    A_mn_vec = Cai_pmn_vec*np.power(cos_matrix_a[:,np.newaxis],2 * prng - np.abs(m) - n)
                    B_mn_vec = (2 * prng - np.abs(m) - n) * Cai_pmn_vec * np.power(cos_matrix_a[:,np.newaxis], np.abs(2 * prng - np.abs(m) - n - 1))
                A_mn = np.sum(A_mn_vec,axis=1)
                B_mn = np.sum(B_mn_vec,axis=1)

This is the result:

================== PerfTool ==================
task                     |aver(s) |sum(s)  |count   |std     
main loop                |   0.072|   5.736|      80|   0.052
  +-second loop          |   0.072|   5.735|      80|   0.052
    +-A                  |   0.000|   0.194|    6560|   0.000
    +-B_vec              |   0.001|   3.490|    6560|   0.000
      +-B_vec_slow       |   0.000|   2.987|    6560|   0.000
    +-C                  |   0.000|   0.126|    6560|   0.000
    +-D                  |   0.000|   0.536|    6560|   0.000
    +-E                  |   0.000|   0.768|    6560|   0.000
    +-F                  |   0.000|   0.522|    6560|   0.000
preparation              |   0.000|   0.000|       1|   0.000
mx creation              |   0.000|   0.000|       1|   0.000

overall                  |    0.01|    5.74|   46082|-

If you can work on these 2 lines you can expect to run in 2/3 sec.

HERE: the optimized code: https://www.codepile.net/pile/8oDyGp6Q

Very interesting! I still get slightly faster solution with Numba but this is very promising. Three comments: 1) I have seen that the gamma function is included in numpy and can be performed as ufunc (array programming) and it can be used instead of the factorial. Do you think it could be of any help? 2) In your code, I get some warnings of divided by zero when `/ (2**n * fact(n))` I guess they get larger values than it can be handled for very large n. I am wondering why this is not happening in my code. 3) In perf_tool I have to Quit console if I want to run it twice. Is this normal? — user536696, Dec 30 '20 at 10:24
1) Yes, all numpy function can work on array 2) maybe, probably something can be solved using differente dtypes (i've no tested result of my codes, so it is intended for indicate the correct way.) 3) You can use PerfTool.empty() at the beginnig to reset counters good luck — Glauco, Dec 30 '20 at 13:56

score 0 · Answer 2 · answered Dec 28 '20 at 13:57

I'm not sure about which loops are faster than others in Python, but as I can see in your code, there are some duplicated function calls (same arguments).

For instance:

A_mn = A_mn / (2**n * math.factorial(n))
B_mn = B_mn / (2**n * math.factorial(n))

Both declarations share same denominator. Try to calculate first the factorial, save it into a variable and use it as denominator.

Also, np.exp() function is called multiple times, all with the same arguments:

np.exp(-1j*a)/(a)
np.exp(1j * m * matrix_b)

Numpy exponential function can be a little slow in execution time.

The same thing happens with all np.sin / np.cos(matrix_a) calls.

These changes won't provide you a huge improvement, but it's something:

def function_call_2(npp,nmax):
    matrix_a = np.random.rand(npp)
    matrix_b = np.random.rand(npp)
    a=np.random.rand()
    F = np.zeros((2*npp, 2*nmax*(nmax+2)),dtype=np.complex_) 
    npa=np.arange(npp)

    # Auxiliary variables
    sin_matrix_a = np.sin(matrix_a)
    cos_matrix_a = np.cos(matrix_a)

    for n in range(1,nmax+1,1):

        # Auxilary variables
        denominator = (2**n * math.factorial(n))

        a_n = np.sqrt(1 / (2 * np.pi * n * (n + 1)))
        for m in range(-n,n+1,1):        
            b_m = (-1)**((np.abs(m) + m) / 2)
            p_mn = int(1 / 2 * (np.abs(m) + n + 1 / 2 * (1 - (-1)**(np.abs(m) + n))))
            alpha_mn = np.sqrt(((2 * n + 1) / 2) * math.factorial(n - np.abs(m)) / math.factorial(n + np.abs(m)))
            A_mn = np.zeros(npp)
            B_mn = np.zeros(npp)
            for p in range(p_mn,n+1,1):
                Cai_pmn = math.factorial(n) * ((-1)**(n + p)) / (math.factorial(p) * math.factorial(n - p)) * math.factorial(2 * p)/math.factorial(2 * p - np.abs(m) - n)
                A_mn = A_mn + Cai_pmn * (cos_matrix_a)**(2 * p - np.abs(m) - n)
                B_mn = B_mn + (2 * p - np.abs(m) - n) * Cai_pmn * (cos_matrix_a)**(np.abs(2 * p - np.abs(m) - n - 1))
            A_mn = A_mn / denominator
            B_mn = B_mn / denominator
            S_mn = alpha_mn * m * A_mn * sin_matrix_a**np.abs(np.abs(m) - 1)
            D_mn = alpha_mn * (np.abs(m) * A_mn * cos_matrix_a * (sin_matrix_a)**(np.abs(np.abs(m) - 1)) - B_mn * (sin_matrix_a)**(np.abs(m) + 1))

            # Auxilary variables
            np_exp_1 = np.exp(-1j*a)/(a)

            np_exp_2 = np.exp(1j * m * matrix_b)

            h1 = 1j**(n+1)*np_exp_1
            h2 = 1j**(n)*np_exp_1

            F_s1_theta = 1j * a_n * b_m * h1 * (S_mn * np_exp_2)
            F_s1_phi =       -a_n * b_m * h1 * (D_mn * np_exp_2)
            F_s2_theta =      a_n * b_m * h2 * (D_mn * np_exp_2)
            F_s2_phi =   1j * a_n * b_m * h2 * (S_mn * np_exp_2)
            j = 2 * (n * (n + 1) + m - 1)
            F[2 * npa, j] = F_s1_theta[npa]
            F[2 * npa+1  , j] = F_s1_phi[npa]
            j = 2 * (n * (n + 1) + m - 1) + 1
            F[2 * npa, j] = F_s2_theta[npa]
            F[2 * npa+1, j] = F_s2_phi[npa]

I made a simple test, and I get a ~3 seconds faster execution:

prev_time_ep =time.time()
npp=500
nmax=80 
function_call(npp,nmax)           
print("--- %s seconds ---" % (time.time() - prev_time_ep))

prev_time_ep =time.time()
function_call_2(npp,nmax)           
print("--- %s seconds ---" % (time.time() - prev_time_ep))

Output:

--- 16.99950885772705 seconds ---
--- 14.231853580474854 seconds ---

Yes, there are some repeated functions that are consuming some resources, this is a good point. However I was looking for some method that reduces time by orders of magnitude if possible. — user536696, Dec 29 '20 at 10:24

Glauco · Answer 3 · 2021-01-01T17:58:51.363

0

first of all your code has O(n^3) complexity that is not a big deal.

A lot of this code can be done using array programming that will speed up a lot expecially using numpy.

I suggest to use a profiler to find not performing code and start to rewrite the code in vectrial out of loops.

I wrote a tool perf_tools useful for this kind of works. It can guide you in a sort of performance driven solution.

edited Jan 01 '21 at 17:58

answered Dec 28 '20 at 22:08

Glauco

1,385
2
10
20

Using the array programming is what I mentioned of vectorizing. However, it gets very memory consuming when I want to do compute it at once without loops, specially the A_mn and B_mn computation. Maybe I vectorized in a not efficient way – user536696 Dec 29 '20 at 10:26

Methods of improving the efficiency of a a python-loop computation

3 Answers3