how to speed up loop in numpy?

Question

I would like to speed up this code :

import numpy as np
import pandas as pd

a = pd.read_csv(path)
closep = a['Clsprc']
delta = np.array(closep.diff())
upgain = np.where(delta >= 0, delta, 0)
downloss = np.where(delta <= 0, -delta, 0)
up = sum(upgain[0:14]) / 14
down = sum(downloss[0:14]) / 14
u = []
d = []
for x in np.nditer(upgain[14:]):
    u1 = 13 * up + x
    u.append(u1)
    up = u1
for y in np.nditer(downloss[14:]):
    d1 = 13 * down + y
    d.append(d1)
    down = d1

The data below:

0     49.00
1     48.76
2     48.52
3     48.28
...
36785758    13.88
36785759    14.65
36785760    13.19

Name: Clsprc, Length: 36785759, dtype: float64

The for loop is too slow, what can I do to speed up this code? Can I vectorize the entire operation?

Programmatically I would look for a way to parallelize the operation such as a parallelized implementation of NumPy array or another collection data type (such as is available with Scala collections). There are a number of ways to parallelize for loops, for example see https://pythonhosted.org/joblib/parallel.html. Anaconda has an MKL optimizations package specifically for improving performance of NumPy, SciPy, scikit-learn, and NumExpr, see https://store.continuum.io/cshop/mkl-optimizations/, however its not free. — , Jul 25 '15 at 04:21
Is that really the right code? Those multiplications by 13 look very strange to me-- your values will inf out in no time. I *think* you're trying to compute a rolling mean and you forgot the divisions, but since your code isn't documented, you don't explain what you're trying to do, and your implementation seems weird, it's hard to be sure. — DSM, Jul 25 '15 at 04:49

score 0 · Answer 1 · edited May 23 '17 at 12:29

It looks like you're trying to calculate an exponential moving average (rolling mean), but forgot the division. If that's the case then you may want to see this SO question. Meanwhile, here's a fast a simple moving average using the cumsum() function taken from the referenced link.

def moving_average(a, n=14) :
    ret = np.cumsum(a, dtype=float)
    ret[n:] = ret[n:] - ret[:-n]
    return ret[n - 1:] / n

If this is not the case, and you really want the function described, you can increase the iteration speed by getting using the external_loop flag in your iteration. From the numpy documentation:

The nditer will try to provide chunks that are as large as possible to the inner loop. By forcing ‘C’ and ‘F’ order, we get different external loop sizes. This mode is enabled by specifying an iterator flag.

Observe that with the default of keeping native memory order, the iterator is able to provide a single one-dimensional chunk, whereas when forcing Fortran order, it has to provide three chunks of two elements each.

for x in np.nditer(upgain[14:],  flags=['external_loop'], order='F'):
    # x now has x[0],x[1], x[2], x[3], x[4], x[5] elements.

For a 1d array, the external loop iter just hands out the whole array in one step. It is the same as `x=upgain[...]`. — hpaulj, Jul 25 '15 at 06:15
How is "as large as possible" determined? Can this be used to process a mmap array by breaking it into chunks that fit in memory? — endolith, Mar 21 '21 at 18:05

score 0 · Answer 2 · edited May 23 '17 at 11:58

In simplified terms, I think this is what the loops are doing:

upgain=np.array([.1,.2,.3,.4])    
u=[]
up=1
for x in upgain:                  
    u1=10*up+x
    u.append(u1)
    up=u1

producing:

[10.1, 101.2, 1012.3, 10123.4]

np.cumprod([10,10,10,10]) is there, plus a modified cumsum for the [.1,.2,.3,.4] terms. But I can't off hand think of a way of combining these with compiled numpy functions. We could write a custom ufunc, and use its accumulate. Or we could write it in cython (or other c interface).

https://stackoverflow.com/a/27912352 suggests that frompyfunc is a way of writing a generalized accumulate. I don't expect big time savings, maybe 2x.

To use frompyfunc, define:

def foo(x,y):return 10*x+y

The loop application (above) would be

def loopfoo(upgain,u,u1):
    for x in upgain:
        u1=foo(u1,x)
        u.append(u1)
    return u

The 'vectorized' version would be:

vfoo=np.frompyfunc(foo,2,1) # 2 in arg, 1 out
vfoo.accumulate(upgain,dtype=object).astype(float)

The dtype=object requirement was noted in the prior SO, and https://github.com/numpy/numpy/issues/4155

In [1195]: loopfoo([1,.1,.2,.3,.4],[],0)
Out[1195]: [1, 10.1, 101.2, 1012.3, 10123.4]

In [1196]: vfoo.accumulate([1,.1,.2,.3,.4],dtype=object)
Out[1196]: array([1.0, 10.1, 101.2, 1012.3, 10123.4], dtype=object)

For this small list, loopfoo is faster (3µs v 21µs)

For a 100 element array, e.g. biggain=np.linspace(.1,1,100), the vfoo.accumulate is faster:

In [1199]: timeit loopfoo(biggain,[],0)
1000 loops, best of 3: 281 µs per loop

In [1200]: timeit vfoo.accumulate(biggain,dtype=object)
10000 loops, best of 3: 57.4 µs per loop

For an even larger biggain=np.linspace(.001,.01,1000) (smaller number to avoid overflow), the 5x speed ratio remains.

how to speed up loop in numpy?

2 Answers2

Linked