Q: how do I change my py
script so that it runs as fast as MATLAB?
as abarnet
has already given you a lot of knowledgeable directions, let me add my two cents ( and some quantitative results ).
( similarly I hope you will forgive to skip the for:
& assume a more complex computational task )
review the code for any possible algorithmic improvements, value re-use(s) and register/cache-friendly arrangements ( numpy.asfortranarray()
et al )
use vectorised code-execution / loop-unrolling in numpy
, wherever possible
use LLVM-compiler alike numba
for stable parts of your code
use additional (JIT)-compiler tricks ( nogil = True, nopython = True ) only on final grade of the code to avoid a common premature-optimisation mistake
Achievements that are possible are indeed huge:

An inital code sample is taken from FX arena ( where milliseconds, microseconds & (wasted) nanoseconds indeed do matter - check that for 50% market events you have far less than 900 milliseconds to act ( end-to-end bi-directional transaction ), not speaking about HFT ... ) for processing EMA(200,CLOSE)
- a non-trivial exponential moving average over the last 200 GBPUSD candles/bars in an array of about 5200+ rows:
import numba
#@jit # 2015-06 @autojit deprecated
@numba.jit('f8[:](i8,f8[:])')
def numba_EMA_fromPrice( N_period, aPriceVECTOR ):
EMA = aPriceVECTOR.copy()
alf = 2. / ( N_period + 1 )
for aPTR in range( 1, EMA.shape[0] ):
EMA[aPTR] = EMA[aPTR-1] + alf * ( aPriceVECTOR[aPTR] - EMA[aPTR-1] )
return EMA
For this "classical" code, just the very numba
compilation step has made an improvement over the ordinary python/numpy code execution
21x down to about half a millisecond
# 541L
from about 11499 [us] ( yes, from about 11500 microseconds to just 541 [us] )
# classical numpy
# aClk.start();X[:,7] = EMA_fromPrice( 200, price_H4_CLOSE );aClk.stop()
# 11499L
But, if you take more caution to the algorithm, and re-design it so as to work smarter & more resources-efficiently, the results are even more fruitfull
@numba.jit
def numba_EMA_fromPrice_EFF_ALGO( N_period, aPriceVECTOR ):
alfa = 2. / ( N_period + 1 )
coef = ( 1 - alfa )
EMA = aPriceVECTOR * alfa
EMA[1:]+= EMA[0:-1] * coef
return EMA
# aClk.start();numba_EMA_fromPrice_EFF_ALGO( 200, price_H4_CLOSE );aClk.stop()
# Out[112]: 160814L # JIT-compile-pass
# Out[113]: 331L # re-use 0.3 [ms] v/s 11.5 [ms] CPython
# Out[114]: 311L
# Out[115]: 324L
And the final polishing-touch for multi-CPU-core processing
46x accelerated down to about a quarter of a millisecond
# ___________vvvvv__________# !!! !!!
#@numba.jit( nogil = True ) # JIT w/o GIL-lock w/ multi-CORE ** WARNING: ThreadSafe / DataCoherency measures **
# aClk.start();numba_EMA_fromPrice_EFF_ALGO( 200, price_H4_CLOSE );aClk.stop()
# Out[126]: 149929L # JIT-compile-pass
# Out[127]: 284L # re-use 0.3 [ms] v/s 11.5 [ms] CPython
# Out[128]: 256L
As a final bonus. Faster is sometimes not the same as better.
Surprised?
No, there is nothing strange in this. Try to make MATLAB calculate SQRT( 2 ) to a precision of about 500.000.000 places behind a decimal point. There it goes.
Nanoseconds do matter. The more here, where precision is the target.
Isn't that worth time & efforts? Sure, it is.