3

I need to speed up the processing of this loop as it is very slow. But I don't know how to vectorize it since the result of one value depends on the result of a previous value. Any suggestions?

import numpy as np

sig = np.random.randn(44100)
alpha = .9887
beta = .999

out = np.zeros_like(sig)

for n in range(1, len(sig)):
    if np.abs(sig[n]) >= out[n-1]:
        out[n] = alpha * out[n-1] + (1 - alpha) * np.abs( sig[n] )
    else:
        out[n] = beta * out[n-1]
Christopher Brown
  • 537
  • 1
  • 5
  • 14

2 Answers2

3

Numba's just-in-time compiler should deal with indexing overhead you're facing pretty well by compiling the function to native code during first execution:

In [1]: %cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:import numpy as np
:
:sig = np.random.randn(44100)
:alpha = .9887
:beta = .999
:
:def nonvectorized(sig):
:    out = np.zeros_like(sig)
:
:    for n in range(1, len(sig)):
:        if np.abs(sig[n]) >= out[n-1]:
:            out[n] = alpha * out[n-1] + (1 - alpha) * np.abs( sig[n] )
:        else:
:            out[n] = beta * out[n-1]
:    return out
:--

In [2]: nonvectorized(sig)
Out[2]: 
array([ 0.        ,  0.01862503,  0.04124917, ...,  1.2979579 ,
        1.304247  ,  1.30294275])

In [3]: %timeit nonvectorized(sig)
10 loops, best of 3: 80.2 ms per loop

In [4]: from numba import jit

In [5]: vectorized = jit(nonvectorized)

In [6]: np.allclose(vectorized(sig), nonvectorized(sig))
Out[6]: True

In [7]: %timeit vectorized(sig)
1000 loops, best of 3: 249 µs per loop

EDIT: as suggested in a comment, adding jit benchmarks. jit(nonvectorized) is creating a lightweight wrapper and thus is a cheap operation.

In [8]: %timeit jit(nonvectorized)
10000 loops, best of 3: 45.3 µs per loop

The function itself is compiled during the first execution (hence just-in-time) which takes a while, but probably not as much:

In [9]: %timeit jit(nonvectorized)(sig)
10 loops, best of 3: 169 ms per loop
immerrr
  • 1,251
  • 7
  • 13
  • May you, please, add also a result of **`%timeit vectorized = jit(nonvectorized)`** ? Thanks – user3666197 Sep 30 '14 at 21:36
  • I had tried installing numba previously, but it was just too much hassle (a specific version of llvm was needed and not available for my distro, and I had version conflicts once installed despite the claim on the website that they should (implying could) be installed in parallel. llvmpy also had build errors, etc). And I wasn't convinced it would help with this kind of problem, so I didn't really persue it. But this answer encouraged me to get it installed. It does work nicely. But the other answer gives me almost the same speed up with minor edits and no cumbersome dependency. But thanks! – Christopher Brown Oct 01 '14 at 19:43
  • 1
    @ChristopherBrown, the other answer had 3x speed up, I observed 320x, do you have different figures in your environment? As for the hassle with installation, I'd advise using miniconda. It's free and easy, I installed numba -- for the first time, too -- in about 1.5mins after reading your question. – immerrr Oct 01 '14 at 20:25
  • @immerrr: I did not see the kind of speedups you observed. My informal numbers were: unoptimized: 8.2 s. Syntax optimized (user3666197 answer): 3.4 s. Numba optimized (your answer): 2.9 s. So for me, numba was the best for speed, but not by much. Interestingly, when I combined the two approaches, I got speeds closer to the syntax-only solution than to the numba-only solution (this suggests that the difference might be measurement error, although I ran each many times). I will take your word for it and look more closely at numba. Will report back when I have more data... – Christopher Brown Oct 02 '14 at 01:49
  • @ChristopherBrown, that's interesting... I do agree that there's nothing to vectorize due to the dependency on previous values, so the only option for big speedups is to get "closer to the metal", but closer-to-the-metal solutions are able to do better. If these numbers persist on the latest version of numba, I'd go and report that at github. Cython is also an option though: with a bit of type hints it is able to generate low-level code and go full speed, too: http://nbviewer.ipython.org/gist/immerrr/34c532963405eacbea76 – immerrr Oct 02 '14 at 10:22
1

Low vectorisation potential on a "forward-dependent-loop" code

majority of your "vectorisation" parallelism is out of the game, once the dependency is analysed. ( JIT-compiler cannot vectorise "against" such dependence barrier either )

you may pre-calculate some re-used values in a vectorised manner, but there is no direct python syntax manner ( without an external JIT-compiler workaround ) to arrange forward-shifting-dependence loop computation into your CPU vector-register aligned co-parallel computation:

from zmq import Stopwatch    # ok to use pyzmq 2.11 for [usec] .Stopwatch()
aStopWATCH =    Stopwatch()  # a performance measurement .Stopwatch() instance

sig    = np.abs(sig)         # self-destructive calc/assign avoids memalloc-OPs
aConst = ( 1 - alpha )       # avoids many repetitive SUB(s) in the loop

for thisPtr in range( 1, len( sig ) ): # FORWARD-SHIFTING-DEPENDENCE LOOP:
    prevPtr = thisPtr - 1              # prevPtr->"previous" TimeSlice in out[] ( re-used 2 x len(sig) times )
    if sig[thisPtr] < out[prevPtr]:                                    # 1st re-use
       out[thisPtr] = out[prevPtr] * beta                              # 2nd
    else:
       out[thisPtr] = out[prevPtr] * alpha + ( aConst * sig[thisPtr] ) # 2nd

A good example of vectorised speed-up can be seen in cases, where calculation strategy can be parallelised/broadcast along 1D, 2D or even 3D structure of the native numpy array. For a speedup of about 100x see an RGBA-2D matrix accelerated processing in Vectorised code for a PNG picture processing ( an OpenGL shader pipeline)

Performance increased still about 3x

Even this simple python code revision has increased the speed more than about 2.8x times ( right now, i.e. without undertaking an installation to allow using an ad-hoc JIT-optimising compiler ):

>>> def aForwardShiftingDependenceLOOP(): # proposed code-revision
...     aStopWATCH.start()                # ||||||||||||||||||.start
...     for thisPtr in range( 1, len( sig ) ):
...         #        |vvvvvvv|------------# FORWARD-SHIFTING-LOOP DEPENDENCE
...         prevPtr = thisPtr - 1  #|vvvvvvv|--STEP-SHIFTING avoids Numpy syntax
...         if ( sig[ thisPtr] < out[prevPtr] ):
...             out[  thisPtr] = out[prevPtr] * beta
...         else:
...             out[  thisPtr] = out[prevPtr] * alpha + ( aConst * sig[thisPtr] )
...     usec = aStopWATCH.stop()          # ||||||||||||||||||.stop
...     print usec, " [usec]"

>>> aForwardShiftingDependenceLOOP()
57593  [usec]
57879  [usec]
58085  [usec]

>>> def anOriginalForLOOP():
...     aStopWATCH.start()
...     for n in range( 1, len( sig ) ):
...         if ( np.abs( sig[n] ) >= out[n-1] ):
...             out[n] = out[n-1] * alpha + ( 1 - alpha ) * np.abs( sig[n] )
...         else:
...             out[n] = out[n-1] * beta
...     usec = aStopWATCH.stop()
...     print usec, " [usec]"

>>> anOriginalForLOOP()
164907  [usec]
165674  [usec]
165154  [usec]
Community
  • 1
  • 1
user3666197
  • 1
  • 6
  • 50
  • 92
  • The speed up I get from these simple changes is not quite as much as I see with numba, but it is very close. Thanks! – Christopher Brown Oct 01 '14 at 19:40
  • Another set of speed-up figures may be seen on cases, where vectorisation potential goes on max. "**...using the code went from 22.14 seconds to 0.1624 seconds!**" ( examples for both Numpy >>> http://stackoverflow.com/a/26111541/3666197 and MATLAB with complex re-arrangement of the calculation approach >>> http://stackoverflow.com/a/26102440/3666197 ) – user3666197 Oct 02 '14 at 13:31