Make a FOR LOOP with numpy arrays multiplication and "appending" faster

Question

This is my current code.

It works fine when the number of elements in the second "for loop" is low (around 10k) and it takes only a few seconds, but when he number of elements in the second "for loop" is high (around 40k) the time it takes is around 60 seconds or more: why?

For example: sometimes when the 2nd for loop has 28k elements it takes less time to execute it than when it has 7k elements. I don't understand why the time isn't linearly dependent on the number of operations.

Also, as a general rule, the longer the code runs, the bigger the loop times become.

To recap, the execution times usually follow these rules:

when operations < 10k , time < 5 seconds
when 10k < operations < 40k , 10s < time < 30s (seems random)
when operations > 40k , time ~ 60 seconds

.

from random import random
import numpy as np
import time
import gc
from collections import deque
import random

center_freq = 20000000 
smpl_time = 0.03749995312*pow(10,-6) 

mat_contents = []

for i in range(10):
    mat_contents.append(np.ones((3600, random.randint(3000,30000))))

tempo = np.empty([0,0])

for i in range(3600):
    tempo = np.append(tempo, center_freq*smpl_time*i)

ILO = np.cos(tempo) 

check = 0

for element in mat_contents:
    start_time = time.time()
    Icolumn = np.empty([0,0])
    Imatrix = deque([])
    gc.disable()

    for colonna in element.T:
        Imatrix.append(np.multiply(ILO, colonna))

    gc.enable()
    varI = np.var(Imatrix)
    tempImean = np.mean(Imatrix)

    print('\nSize: ', len(element.T))
    print("--- %s seconds ---" % (time.time() - start_time))
    print("---  ", check, ' ---')
    check += 1

It might be due to the inefficient strided access pattern (see [this](https://stackoverflow.com/questions/71818324/how-does-the-cpu-cache-affect-the-performance-of-a-c-program)). That being said, we cannot *reproduce* the problem since the code is *incomplete*. Thus, this is hard to tell what is the source of the issue. Can you make a [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example)? — Jérôme Richard, May 08 '22 at 14:21
Results are relatively linear on my machine (Windows, CPython 3.8, Numpy 1.20): https://pastebin.com/pC4MUR9g . The 7K case is relatively fast compared to others. I think this is dependent of the system allocator. But why do you want to use a loop for the multiplication while a vector-matrix multiplication should do the job far more efficiently? The current code create lot of temporary array and make very inefficient cache access. This will cause weird system effects to be apparent (eg. cache trashing, allocation policy, page faults, prefetching, etc.). Why did you disable the GC by the way? — Jérôme Richard, May 08 '22 at 17:11

hpaulj · Accepted Answer · 2022-05-08T20:01:31.387

Your code as is killed my session, presumably because the array dimension(s) was too big for memory. Trying again with smaller numbers:

   In [1]: from collections import deque
   ...: import random
   ...: 
   ...: center_freq = 20000000
   ...: smpl_time = 0.03749995312*pow(10,-6)
   ...: 
   ...: mat_contents = []
   ...: 
   ...: for i in range(10):
   ...:     mat_contents.append(np.ones((36, random.randint(30,300))))
   ...: 
   ...: tempo = np.empty([0,0])
   ...: 
   ...: for i in range(3600):
   ...:     tempo = np.append(tempo, center_freq*smpl_time*i)
   ...: 
   ...: ILO = np.cos(tempo)
   ...: 
   ...: check = 0
In [2]: 
In [2]: tempo.shape
Out[2]: (3600,)
In [3]: ILO
Out[3]: 
array([ 1.        ,  0.73168951,  0.07073907, ..., -0.63602366,
       -0.99137119, -0.81472814])
In [4]: len(mat_contents)
Out[4]: 10

A quick note - while list append in your mat_contents loop is relatively fast, the np.append in the next loop sucks bad. Don't use it that way.

I don't know if I have time now to look at the next loop, BUT, why are you using a deque there? Why not a list? I haven't worked with a deque much, but I don't think it has any advantages over list with this right side append.

CorrectingILO for the reduced shape

In [16]:         tempo = np.empty([0,0])
    ...:    ...:
    ...:    ...: for i in range(36):
    ...:    ...:     tempo = np.append(tempo, center_freq*smpl_time*i)
    ...: 
In [17]: tempo.shape
Out[17]: (36,)
In [18]: ILO = np.cos(tempo)

I suspect that loop can replaced by one numpy expression, but I won't dig into that yet.

In [19]: %%timeit
    ...: for element in mat_contents:
    ...:     Icolumn = np.empty([0,0])
    ...:     Imatrix = deque([])
    ...:     for colonna in element.T:
    ...:         Imatrix.append(np.multiply(ILO, colonna))
    ...:     varI = np.var(Imatrix)
    ...:     tempImean = np.mean(Imatrix)
    ...: 
5.82 ms ± 6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Time with Imatrix = [] is the same, so using deque doesn't hurt.

mat_contents contains arrays with varying shape (2nd dimension):

In [22]: [x.shape for x in mat_contents]
Out[22]: 
[(36, 203),
 (36, 82),
 (36, 194),
 (36, 174),
 (36, 119),
 (36, 190),
 (36, 272),
 (36, 68),
 (36, 293),
 (36, 248)]

Now lets examine the slow loop that you are worried about:

In [23]: ILO.shape
Out[23]: (36,)
In [24]: element=mat_contents[0]
In [25]: element.T.shape
Out[25]: (203, 36)
In [26]: Imatrix =[]
    ...: for colonna in element.T:
    ...:         Imatrix.append(np.multiply(ILO, colonna))
    ...: arr = np.array(Imatrix)
In [27]: arr.shape
Out[27]: (203, 36)

To multiply a (36,) array and a (203,36) we don't need to loop.

In [28]: np.allclose(ILO*element.T, arr)
Out[28]: True
In [29]: %%timeit
    ...: Imatrix =[]
    ...: for colonna in element.T:
    ...:         Imatrix.append(np.multiply(ILO, colonna))
    ...: arr = np.array(Imatrix)
    ...: 
    ...: 
405 µs ± 2.72 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

The whole-array operation is much faster:

In [30]: timeit ILO*element.T
19.4 µs ± 14.3 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In your original question you loaded a .mat file. Are you trying to translate MATLAB code? numpy is like old-time MATLAB, where whole-matrix operations are much faster. numpy does not do jit compilation of loops.

thank you, seems that the array*matrix multiplication solved most of the issues, but I'll implement your other suggestions. — Alessandro, May 10 '22 at 14:19

Make a FOR LOOP with numpy arrays multiplication and "appending" faster

1 Answers1