5

I have read this ( Is MATLAB faster than Python? ) and I find it has lots of ifs.

I have tried this little experiment on an old computer that still runs on Windows XP.

In MATLAB R2010b I have copied and pasted the following code in the Command Window:

tic
x = 0.23;
for i = 1:100000000
  x = 4 * x * (1 - x);
end
toc
x

The result was:

Elapsed time is 0.603583 seconds.

x =

    0.947347510922557

Then I saved a py file with the following script:

import time
t = time.time()
x = 0.23
for i in range(100000000): x = 4 * x * (1 - x)
elapsed = time.time() - t
print(elapsed)
print(x)

I pressed F5 and the result was

49.78125
0.9473475109225565

In MATLAB it took 0.60 seconds; in Python it took 49.78 seconds (an eternity!!).

So the question is: is there a simple way to make Python as fast as MATLAB?

Specifically: how do I change my py script so that it runs as fast as MATLAB?


UPDATE

I have tried the same experiment in PyPy (copying and pasting the same code as above): it did it in 1.0470001697540283 seconds on the same machine as before.

I repeated the experiments with 1e9 loops.

MATLAB results:

Elapsed time is 5.599789 seconds.
1.643573442831396e-004

PyPy results:

8.609999895095825
0.00016435734428313955

I have also tried with a normal while loop, with similar results:

t = time.time()
x = 0.23
i = 0
while (i < 1000000000):
    x = 4 * x * (1 - x)
    i += 1

elapsed = time.time() - t
elapsed
x

Results:

8.218999862670898
0.00016435734428313955

I am going to try NumPy in a little while.

Community
  • 1
  • 1
rappr
  • 135
  • 7
  • 2
    (1) Use NumPy arrays instead of loops. (2) Use PyPy instead of CPython. (3) Manually lift the computation outside the loop, since it's static, and then you can eliminate the loop. :) – abarnert May 27 '15 at 07:08
  • 2
    Python2? If yes, first thing I'd do is to change range to xrange(). – Łukasz Rogalski May 27 '15 at 07:09
  • 1
    Did you read the question you have linked? Because it talks about how to improve performance in Python… – poke May 27 '15 at 07:10
  • This question is a little pointless since you've not optimised the code. Unless you are a performance expert in both languages you are the wrong person to perform this comparison. – David Heffernan May 27 '15 at 07:18
  • 2
    At least three people have now brought up `range`. First, this looks like Python 3 code (he's using Python 3 `print` syntax). Second, it takes milliseconds to allocate that list; optimizing that is the wrong target, unless he's actually running into space issues. – abarnert May 27 '15 at 07:23

3 Answers3

11

First, using time is not a good way to test code like this. But let's ignore that.


When you have code that does a lot of looping and repeating very similar work each time through the loop, PyPy's JIT will do a great job. When that code does the exact same thing every time, to constant values that can be lifted out of the loop, it'll do even better. CPython, on the other hand, has to execute multiple bytecodes for each loop iteration, so it will be slow. From a quick test on my machine, CPython 3.4.1 takes 24.2 seconds, but PyPy 2.4.0/3.2.5 takes 0.0059 seconds.

IronPython and Jython are also JIT-compiled (although using the more generic JVM and .NET JITs), so they tend to be faster than CPython for this kind of work as well.


You can also generally speed up work like this in CPython itself by using NumPy arrays and vector operations instead of Python lists and loops. For example, the following code takes 0.011 seconds:

i = np.arange(10000000)
i[:] = 4 * x * (1-x)

Of course in that case, we're explicitly just computing the value once and copying it 10000000 times. But we can force it to actually compute over and over again, and it still takes only 0.12 seconds:

i = np.zeros((10000000,))
i = 4 * (x+i) * (1-(x+i))

Other options include writing part of the code in Cython (which compiles to a C extension for Python), and using Numba, which JIT-compiles code within CPython. For toy programs like this, neither may be appropriate—the time spent auto-generating and compiling C code may swamp the time saved by running C code instead of Python code if you're only trying to optimize a one-time 24-second process. But in real-life numerical programming, both are very useful. (And both play nicely with NumPy.)

And there are always new projects on the horizon as well.

abarnert
  • 354,177
  • 51
  • 601
  • 671
  • Thank you for your answer. I need some time to read your answers and comments, and check the alternatives. BTW, I am using Python 3. – rappr May 27 '15 at 08:24
  • @rappr: Also read the information in the answers to the question you linked in the first place. While some of it is out of date (e.g., for NumPy on Windows, you don't want to get a custom ATLAS build, you want an MKL build—and getting that is as easy as going to [Christoph Gohlke's repo](http://www.lfd.uci.edu/~gohlke/pythonlibs/)), the basic ideas are mostly still relevant. – abarnert May 27 '15 at 08:26
  • @abamert I have tried `i = np.zeros((10000000,))` and then `i = 4 * (x+i) * (1-(x+i))` but it creates 10,000,000 times the same number (4 * 0.23 * .77). Why is that? – rappr May 28 '15 at 08:24
  • 1
    @rappr: Because that's what you're asking it to do. If `x` is a scalar, and every element of `i` has the same value, and the expression doesn't reference anything but `x` and `i`, then of course every element in the result will be the same. You're computing the same value 10000000 times. If you want to compute 10000000 different values, then you need to start with 10000000 different values (e.g., if `i = arange(10000000)`, then after `i = 4 * (x+i) * (1-(x+i))`, you'll have 10000000 different values). – abarnert Jun 05 '15 at 06:07
4

A (somewhat educated) guess is that python does not perform loop unrolling on your code while MATLAB does. This means the MATLAB code is performing one large computation rather than many (!) smaller ones. This is a major reason for going with PyPy rather than CPython, as PyPy does loop unrolling.

If you're using python 2.X, you should substitute range for xrange, as range (in python 2.X) creates a list to iterate through.

EvenLisle
  • 4,672
  • 3
  • 24
  • 47
  • 1
    Unless he's got very tiny RAM, the cost of creating that list is almost nothing compared to the cost of iterating it and doing the same calculation over and over. – abarnert May 27 '15 at 07:15
  • I know, that's why I put the loop unrolling argument first. But it's never a bad idea to use less memory than you have to. – EvenLisle May 27 '15 at 07:23
  • 1
    Sure, but in any real-life program like this, I'd probably attack the problem by wasting a similar amount of memory using NumPy so I can vectorize instead of iterating; saving 20+ seconds of time vs. spending 80MB of memory is usually a no-brainer… – abarnert May 27 '15 at 07:26
  • No argument from me, merely pointing out that if OP has the freedom of choice between CPython and PyPy, in this instance he might prefer to go with PyPy rather than CPython and numpy (as per your answer, PyPy outperforms numpy). – EvenLisle May 27 '15 at 07:33
  • Definitely. That's why I explained PyPy first as well. Anything to help people get over the idea that PyPy is some experimental, not-production-ready thing. There are things it's not good for (gluing together a bunch of C libraries, and, sadly, using large parts of the SciPy stack), and sometimes you can't use it (because you're deploying some app on machines you can't install software on), and it's a little behind CPython in 3.x features—but when it's appropriate, definitely use it. – abarnert May 27 '15 at 07:36
  • The MATLAB link you included is not relevant - it's a link to MATLAB Coder, which is an add-on product to MATLAB that converts MATLAB code to C code. The content of the link describes how to control loop unrolling in the generated C code, not in the original MATLAB code. Modern versions of MATLAB are JIT-compiled and will vectorize some, but by no means all, for loops. – Sam Roberts May 27 '15 at 13:19
0

Q: how do I change my py script so that it runs as fast as MATLAB?

as abarnet has already given you a lot of knowledgeable directions, let me add my two cents ( and some quantitative results ).

( similarly I hope you will forgive to skip the for: & assume a more complex computational task )

  • review the code for any possible algorithmic improvements, value re-use(s) and register/cache-friendly arrangements ( numpy.asfortranarray() et al )

  • use vectorised code-execution / loop-unrolling in numpy, wherever possible

  • use LLVM-compiler alike numba for stable parts of your code

  • use additional (JIT)-compiler tricks ( nogil = True, nopython = True ) only on final grade of the code to avoid a common premature-optimisation mistake

Achievements that are possible are indeed huge:

Where nanoseconds matter

An inital code sample is taken from FX arena ( where milliseconds, microseconds & (wasted) nanoseconds indeed do matter - check that for 50% market events you have far less than 900 milliseconds to act ( end-to-end bi-directional transaction ), not speaking about HFT ... ) for processing EMA(200,CLOSE) - a non-trivial exponential moving average over the last 200 GBPUSD candles/bars in an array of about 5200+ rows:

import numba
#@jit                                               # 2015-06 @autojit deprecated
@numba.jit('f8[:](i8,f8[:])')
def numba_EMA_fromPrice( N_period, aPriceVECTOR ):
    EMA = aPriceVECTOR.copy()
    alf = 2. / ( N_period + 1 )
    for aPTR in range( 1, EMA.shape[0] ):
        EMA[aPTR] = EMA[aPTR-1] + alf * ( aPriceVECTOR[aPTR] - EMA[aPTR-1] )
    return EMA

For this "classical" code, just the very numba compilation step has made an improvement over the ordinary python/numpy code execution

21x down to about half a millisecond

#   541L

from about 11499 [us] ( yes, from about 11500 microseconds to just 541 [us] )

#       classical numpy
# aClk.start();X[:,7] = EMA_fromPrice( 200, price_H4_CLOSE );aClk.stop()
# 11499L

But, if you take more caution to the algorithm, and re-design it so as to work smarter & more resources-efficiently, the results are even more fruitfull

@numba.jit
def numba_EMA_fromPrice_EFF_ALGO( N_period, aPriceVECTOR ):
    alfa    = 2. / ( N_period + 1 )
    coef    = ( 1 - alfa )
    EMA     = aPriceVECTOR * alfa
    EMA[1:]+= EMA[0:-1]    * coef
    return EMA

#   aClk.start();numba_EMA_fromPrice_EFF_ALGO( 200, price_H4_CLOSE );aClk.stop()
#   Out[112]: 160814L                               # JIT-compile-pass
#   Out[113]:    331L                               # re-use 0.3 [ms] v/s 11.5 [ms] CPython
#   Out[114]:    311L
#   Out[115]:    324L

And the final polishing-touch for multi-CPU-core processing


46x accelerated down to about a quarter of a millisecond

# ___________vvvvv__________# !!!     !!! 
#@numba.jit( nogil = True ) # JIT w/o GIL-lock w/ multi-CORE ** WARNING: ThreadSafe / DataCoherency measures **
#   aClk.start();numba_EMA_fromPrice_EFF_ALGO( 200, price_H4_CLOSE );aClk.stop()
#   Out[126]: 149929L                               # JIT-compile-pass
#   Out[127]:    284L                               # re-use 0.3 [ms] v/s 11.5 [ms] CPython
#   Out[128]:    256L

As a final bonus. Faster is sometimes not the same as better.

Surprised?

No, there is nothing strange in this. Try to make MATLAB calculate SQRT( 2 ) to a precision of about 500.000.000 places behind a decimal point. There it goes.

Nanoseconds do matter. The more here, where precision is the target.


Isn't that worth time & efforts? Sure, it is.

Community
  • 1
  • 1
user3666197
  • 1
  • 6
  • 50
  • 92