How to improve a cascade for loop in python

Question

I have a code that runs really quickly in Matlab (within few seconds). I had to port the code to python for making it more open-source friendly. This code has the dreaded slow 'for' loop which is nested with cascade update. Takes about 45 mins to run the same code. However, I am unable to speed up these loops by rearranging the calculations or even trying to introduce multiprocessing as I only have Ryzen 5 laptop. I am posting here the code snippet where the code spends maximum time. Can anyone please suggest an optimization technique to speed this up.

Python code:

from scipy import signal
from matplotlib import *
from numpy import zeros, ones, multiply, array, arange
import time

fs = 16000
npoints = 16000
Ntem = 30

# create a log-sine-sweep
f0 = 10                       # sweep start frequency
f1 = fs/2                     # sweep end frequency
t1 = arange(npoints)/fs       # sample times
stimulus = signal.chirp(t1, f0, t1[-1], f1, method='logarithmic', phi=-90)


W0 = zeros((Ntem, npoints))    
W1 = zeros((Ntem, npoints))    
BM = zeros((Ntem, npoints))    
BM_d = zeros((Ntem, npoints))

a0rf = array([-0.721112,-0.532501,-0.335426,-0.143986,0.0333392,0.192211,0.331123,
0.45037,0.551292,0.635753,0.705805,0.763481,0.810679,0.849101,0.880237,0.905366,
0.925569,0.941754,0.954673,0.964947,0.973084,0.979502,0.984539,0.988471,0.99152,
0.993866,0.995654,0.997002,0.998002,0.998729])

c0rf = array([0.692818,0.84643,0.942067,0.98958,0.999444,0.981354,0.943588,0.892842,
0.834312,0.771893,0.708406,0.64583,0.585491,0.528231,0.474535,0.424633,0.378578,
0.336301,0.297656,0.262447,0.230451,0.201436,0.175166,0.151413,0.129957,0.110592,
0.0931255,0.0773789,0.0631885,0.0504047])

gf = array([0.879889,0.813389,0.75372,0.703499,0.662554,0.629685,0.603486,0.582661,
0.566118,0.552971,0.542517,0.534197,0.527574,0.522299,0.518099,0.514757,0.512099,
0.509988,0.508315,0.50699,0.505945,0.505123,0.504478,0.503976,0.503586,0.503286,
0.503056,0.502883,0.502754,0.50266])

ghf = array([0.692818,0.84643,0.942067,0.98958,0.999444,0.981354,0.943588,0.892842,
0.834312,0.771893,0.708406,0.64583,0.585491,0.528231,0.474535,0.424633,0.378578,
0.336301,0.297656,0.262447,0.230451,0.201436,0.175166,0.151413,0.129957,0.110592,
0.0931255,0.0773789,0.0631885,0.0504047])

start_time = time.time()

for t in range(npoints-1):
    for s in range(Ntem):
        if (s == 0):
            stim = stimulus[t]
        else:
            stim = BM[s-1, t]
        W0[s, t] = stim + a0rf[s]*W0[s, t-1] - c0rf[s]*W1[s, t-1]
        W1[s, t] = a0rf[s]*W1[s, t-1] + c0rf[s]*W0[s, t-1]
        BM[s, t] = gf[s]*stim + ghf[s]*W1[s, t]
        BM_d[s, t] = BM[s, t] - BM[s, t-1]

elapsed_time = time.time() - start_time

print('time cost = ',elapsed_time)

This code takes 2.8142993450164795 secs

Matlab code:

fs = 16000.0      ;            % sample frequency
npoints = 16000    ;        % stimulus length
Ntem = 30;

f0 = 10         ;              % sweep start frequency
f1 = fs/2       ;              % sweep end frequency
t1 = 1:npoints ;
t1= t1./fs   ;    % sample times
gain = 0.1;
stimulus = gain*chirp(t1, f0, t1(npoints), f1, 'logarithmic', -90);
stimulus = (stimulus); 

W0 = zeros(Ntem,npoints);                         
W1 = zeros(Ntem,npoints);                          
BM = zeros(Ntem,npoints);                    
BM_d = zeros(Ntem,npoints); 
stim=0;

a0rf = [-0.721112 -0.532501 -0.335426 -0.143986 0.0333392 0.192211 0.331123 ...
0.45037 0.551292 0.635753 0.705805 0.763481 0.810679 0.849101 0.880237 ...
0.905366 0.925569 0.941754 0.954673 0.964947 0.973084 0.979502 0.984539 ...
0.988471 0.99152 0.993866 0.995654 0.997002 0.998002 0.998729];

c0rf = [0.692818 0.84643 0.942067 0.98958 0.999444 0.981354 0.943588 0.892842 ...
0.834312 0.771893 0.708406 0.64583 0.585491 0.528231 0.474535 0.424633 0.378578 ...
0.336301 0.297656 0.262447 0.230451 0.201436 0.175166 0.151413 0.129957 0.110592 ...
0.0931255 0.0773789 0.0631885 0.0504047];

gf = [0.879889 0.813389 0.75372 0.703499 0.662554 0.629685 0.603486 0.582661 ...
0.566118 0.552971 0.542517 0.534197 0.527574 0.522299 0.518099 0.514757 0.512099 ...
0.509988 0.508315 0.50699 0.505945 0.505123 0.504478 0.503976 0.503586 0.503286 ...
0.503056 0.502883 0.502754 0.50266];

ghf = [0.692818 0.84643 0.942067 0.98958 0.999444 0.981354 0.943588 0.892842 ...
0.834312 0.771893 0.708406 0.64583 0.585491 0.528231 0.474535 0.424633 0.378578 ...
0.336301 0.297656 0.262447 0.230451 0.201436 0.175166 0.151413 0.129957 0.110592 ...
0.0931255 0.0773789 0.0631885 0.0504047];

tic
for t=2:(npoints)
    for s=1:(Ntem)
        if (s==1)
            stim = stimulus(t);
        else
            stim = BM(s-1,t);
        end
        W0(s,t) = stim + a0rf(s)*W0(s,t-1) - c0rf(s)*W1(s,t-1);
        W1(s,t) = a0rf(s)*W1(s,t-1) + c0rf(s)*W0(s,t-1);
        BM(s,t) = gf(s)*stim + ghf(s)*W1(s,t);
        BM_d(s,t) = BM(s,t) - BM(s,t-1);
    end
    
end
toc

This code takes 0.091413 secs

So, the stimulus here I have a fixed value. In my original code, I have an array of 160 stimulus inputs, which would increase the time further for these codes. That just compounds the delay in python code further.

I am not really sure why this exact code runs so much faster in MatLab. It should be about the same. Nested for loops scale horribly. Can you figure out a method to remove one of these? — Justin Oberle, Apr 28 '21 at 11:44
The above code has one more loop of 160 times, where the stimulus input changes everytime. That makes it 160×16000×30 times which runs for about 20 seconds in matlab but takes around 45 minutes in python. I am unable to remove any code here. This is actually a more efficient version of the original code. — Abhishek Nair, Apr 28 '21 at 13:36
Can you post the complete reproduceable code? Meaning, everything I need to run it. Your comment suggests 3 for loops, 2 nested but this code shows 2 for loops, 1 nested. — Justin Oberle, Apr 28 '21 at 13:57
Are you sure the two versions run the same? In the Python code, the `else` does not end until the last line of for loop, whereas for the matlab case the `else` is ended immediately after the next line. — YiFei, Apr 29 '21 at 04:20
Sorry about the indentation. Updated the code now to make it same for both — Abhishek Nair, Apr 29 '21 at 04:26
Is the loop range for t correct? It looks like it will cause negative indexes when t==0. — Ture Pålsson, Apr 29 '21 at 05:16
That won't matter as -1 in python will take last but 1 array value which for 1st iteration is 0, so it won't affect the output, we can start from 1 just like it is in matlab where it starts from 2. — Abhishek Nair, Apr 29 '21 at 06:30
Can you place a counter for each for loop in matlab code and also in python code just to verify both are running the same number of loops? If this is true, then the issue is not related to this. I think you should check this before anything else though. create a counter1 for first for loop and counter2 for second and += 1 for each iteration. Do this for both code. — Justin Oberle, Apr 29 '21 at 12:44
@JustinOberle . I have done counting too for both versions of code. They count the same number of loops. Total inner loop count = 479971. Outer loop count = 16000. Infact the outputs produced by both set of codes are same. So there is no doubt that these codes execute the same way. — Abhishek Nair, May 02 '21 at 08:02
You have 4 computationally heavy items in each loop. Each one takes about 0.8 seconds on my computer totaling about 3.2 seconds for the completed nested loops to run. The loops are identical in how many times they run but the way these computations/insertions are done in a numpy array is much slower than the way matlab is doing it. I suggest reading the answer in this link. It explains what is going on under the hood for numpy and how to vectorize your equation instead of using a loop. https://stackoverflow.com/questions/18516605/difference-on-performance-between-numpy-and-matlab — Justin Oberle, May 03 '21 at 13:14
[This](https://stackoverflow.com/questions/4407984/is-it-possible-to-vectorize-recursive-calculation-of-a-numpy-array-where-each-el) does not look very encouraging. :( — Ture Pålsson, May 03 '21 at 18:23

score 0 · Answer 1 · edited Jun 07 '21 at 12:47

You need to use optimized numpy vectorization. below is an example of this and shows how it optimzes runtime. Your equations use t-1 instead of t which makes things difficult using this method. Below uses t to simply show how much faster it is. I will let you figure out how to implement it with your exact equations since I do not know the details behind them. For a more detailed explanation of why this is more optimal see this link... Difference on performance between numpy and matlab

from scipy import signal
from matplotlib import *
from numpy import zeros, ones, multiply, array, arange, roll
import time

fs = 16000
npoints = 16000
Ntem = 30

# create a log-sine-sweep
f0 = 10                       # sweep start frequency
f1 = fs/2                     # sweep end frequency
t1 = arange(npoints)/fs       # sample times
stimulus = signal.chirp(t1, f0, t1[-1], f1, method='logarithmic', phi=-90)


W0 = zeros((Ntem, npoints))  
W1 = zeros((Ntem, npoints))    
BM = zeros((Ntem, npoints))    
BM_d = zeros((Ntem, npoints))

a0rf = array([-0.721112,-0.532501,-0.335426,-0.143986,0.0333392,0.192211,0.331123,
0.45037,0.551292,0.635753,0.705805,0.763481,0.810679,0.849101,0.880237,0.905366,
0.925569,0.941754,0.954673,0.964947,0.973084,0.979502,0.984539,0.988471,0.99152,
0.993866,0.995654,0.997002,0.998002,0.998729])

c0rf = array([0.692818,0.84643,0.942067,0.98958,0.999444,0.981354,0.943588,0.892842,
0.834312,0.771893,0.708406,0.64583,0.585491,0.528231,0.474535,0.424633,0.378578,
0.336301,0.297656,0.262447,0.230451,0.201436,0.175166,0.151413,0.129957,0.110592,
0.0931255,0.0773789,0.0631885,0.0504047])

gf = array([0.879889,0.813389,0.75372,0.703499,0.662554,0.629685,0.603486,0.582661,
0.566118,0.552971,0.542517,0.534197,0.527574,0.522299,0.518099,0.514757,0.512099,
0.509988,0.508315,0.50699,0.505945,0.505123,0.504478,0.503976,0.503586,0.503286,
0.503056,0.502883,0.502754,0.50266])

ghf = array([0.692818,0.84643,0.942067,0.98958,0.999444,0.981354,0.943588,0.892842,
0.834312,0.771893,0.708406,0.64583,0.585491,0.528231,0.474535,0.424633,0.378578,
0.336301,0.297656,0.262447,0.230451,0.201436,0.175166,0.151413,0.129957,0.110592,
0.0931255,0.0773789,0.0631885,0.0504047])

W0test = zeros((Ntem, npoints))    
W1test = zeros((Ntem, npoints))    

start_time1 = time.time()
prevW0 = 0
prevW1 = 0
for t in range(1, npoints-1):
    for s in range(Ntem):
        if (s == 0):
            stim = stimulus[t]
        else:
            stim = BM[s-1, t]
        W0[s, t] = stim + a0rf[s]*W0[s, t] - c0rf[s]*W1[s, t]
        W1[s, t] = a0rf[s]*W1[s, t] + c0rf[s]*W0[s, t]
elapsed_time1 = time.time() - start_time1

# Same result using numpy vectorization...
start_time2 = time.time()
for s in range(Ntem):
    W0test[s] = stimulus + a0rf[s]*W0test[s] - c0rf[s]*W1test[s]
    W1test[s] = a0rf[s]*W1test[s] + c0rf[s]*W0test[s]

elapsed_time2 = time.time() - start_time2

print('time cost for FOR loop = ', elapsed_time1)
print('time cost for optimal numpy vectorization = ', elapsed_time2)

This produces this in the console...

time cost for FOR loop =  2.5937564373016357
time cost for optimal numpy vectorization =  0.005982160568237305

Running this code would yield different outputs for vectorization and the normal code. This changes the functionality altogether. — Abhishek Nair, May 10 '21 at 03:44
Yes, I know. I mentioned it is not the same algorithm as your code. It is instead an example of how much faster this method is. I leave it to you to adjust your code to match this new method. The reason I leave it to you, is because I do not have any background on your algorithm. I think you can figure it out though. Good luck. — Justin Oberle, May 10 '21 at 14:20

How to improve a cascade for loop in python

1 Answers1