Why Python is so slow for a simple for loop?

Question

We are making some kNN and SVD implementations in Python. Others picked Java. Our execution times are very different. I used cProfile to see where I make mistakes but everything is quite fine actually. Yes, I use numpy also. But I would like to ask simple question.

total = 0.0
for i in range(9999): # xrange is slower according 
    for j in range(1, 9999):            #to my test but more memory-friendly.
        total += (i / j)
print total

This snippet takes 31.40s on my computer.

Java version of this code takes 1 second or less on the same computer. Type checking is a main problem for this code, I suppose. But I should make lots of operation like this for my project and I think 9999*9999 is not so big number.

I think I am making mistakes because I know Python is used by lots of scientific projects. But why is this code so slow and how can I handle problems bigger than this?

Should I use a JIT compiler such as Psyco?

EDIT

I also say that this loop problem is only an example. The code is not as simple as like this and It may be tough to put into practice your improvements/code samples.

Another question is that can I implement lots of data mining & machine learning algorithms with numpy and scipy if I use it correctly?

you could also implement parts of your code in C or C++ using tools like Boost::Python — Fábio Diniz, Nov 11 '11 at 17:09
Any interpreted language is going to be 1-2 orders of magnitude slower than a compiled language (like Java). What makes it popular is factors other than performance. You don't always need high performance. If you do, but you also want python, you need to dip into C. — Mike Dunlavey, Nov 11 '11 at 18:33
Even though this is a OLD thread, I find my self looking over it a dozen times.. I had to put this to the test-bench.. Now using PyPy 1.9 this takes `350 milliseconds` on my very average computer. It takes `8 seconds` with Python 2.7. Of course this speed is reached with putting the code within a function, as that always speeds things up in python. I also used `from __future__ import division` as it's **much** faster. — b0bz, Apr 28 '13 at 14:38

score 47 · Accepted Answer · edited Mar 23 '21 at 13:31

Why is Java faster than Python on this example loops?

Novice Explanation: Think of a program like a freight train that lays its own train-track as it moves forward. Track must be laid before the train can move. The Java Freight train can send thousands of track-layers ahead of the train, all working in parallel laying track many miles in advance, wheras python can only send one laboror at a time, and can only lay track 10 feet in front of where the train is.

Java has strong types and that affords the compiler to use JIT features: (https://en.wikipedia.org/wiki/Just-in-time_compilation) which enable the CPU to fetch memory and execute instructions in the future in parallel, before the instruction is needed. Java can 'sort of' run the instructions in your for loop in parallel with itself. Python has no concrete types and so the nature of the work to be done has to be decided at every instruction. This causes your entire computer to stop and wait for all the memory in all of your variables to be re-scanned. Meaning loops in python are polynomial O(n^2) time, wheras Java loops can be, and often are linear time O(n), due to strong types.

I think I am making mistakes because I know Python is used by lots of scientific projects.

They're heavily using SciPy (NumPy being the most prominent component, but I've heard the ecosystem that developed around NumPy's API is even more important) which vastly speeds up all kinds operations these projects need. There's what you are doing wrong: You aren't writing your critical code in C. Python is great for developing in general, but well-placed extension modules are a vital optimization in its own right (at least when you're crunching numbers). Python is a really crappy language to implement tight inner loops in.

The default (and for the time being most popular and widely-supported) implementation is a simple bytecode interpreter. Even the simplest operations, like an integer division, can take hundreds of CPU cycles, multiple memory accesses (type checks being a popular example), several C function calls, etc. instead of a few (or even single, in the case of integer division) instruction. Moreover, the language is designed with many abstractions which add overhead. Your loop allocates 9999 objects on the heap if you use xrange - far more if you use range (99999999 integer minus around 256256 for small integers which are cached). Also, the xrange version calls a method on each iteration to advance - the range version would too if iteration over sequences hadn't been optimized specifically. It still takes a whole bytecode dispatch though, which is itself vastly complex (compared to an integer division, of course).

It would be interesting to see what a JIT (I'd recommend PyPy over Psyco, the latter isn't actively developed anymore and very limited in scope anyway - it might work well for this simple example though). After a tiny fraction of iterations, it should produce a nigh-optimal machine code loop augmented with a few guards - simple integer comparisions, jumping if they fail - to maintain correctness in case you got a string in that list. Java can do the same thing, only sooner (it doesn't have to trace first) and with fewer guards (at least if you use ints). That's why it's so much faster.

Why is Java faster than PyPy on this example loop? Can you provide a reason/citation or is this your gut feeling? As far as I know PyPy should be in range of GCC with SSE optimizations off. — fijal, Nov 21 '11 at 10:43
`CPU cycles`, `memory accesses`, `C function calls`, `abstractions` - is there a resource that lists the overhead costs/footprint of [basic] CPython operations? It would make a good reference for deciding what *not* to do in critical code with/in Python. — n611x007, Jan 29 '13 at 10:05
@naxa I'm not aware of any, and I doubt its value, at least if it gets this low-level, as it would be outdated *very* quickly. These figures change any time someone changes part of the interpreter or runtime library (which happens quite often, even between bugfix releases). In addition, all this extra information doesn't really help you any more than statements like "local variable lookup is several times faster than global variable lookup" (which are much more likely to remain valid). If two extra memory accesses are too much for you, you shouldn't be writing Python to begin with. — , Jan 29 '13 at 12:14
the difficulty of keeping such a list up-to-date may be a show-stopper; but aside from that, I suppose it would pose value in finding culprits for inefficiency, when trying to optimize critical parts of the code. Those not knowing the underlying implementation would often try to turn to a C module like numpy, only to realize later that things like for-looping on a numpy array is a bad idea for multiple reasons. That said, the best choice may be to read the source of the interpreter and/or a rule of thumb to leave all critical parts in C *only* (task not intermixed with python code). — n611x007, Jan 29 '13 at 13:24
@naxa Finding culprits of existing inefficiencies is usually better done by profiling (there are stack-sampling profilers if the overhead of `cProfile` is too much). And yes, having knowledge of the implementation is a good way to spot inefficient code before it's written... if that knowledge is applied wisely. Otherwise you waste time optimizing things that aren't bottle necks. — , Jan 29 '13 at 13:29

Joe Kington · Answer 2 · 2011-11-11T17:36:43.410

17

Because you mention scientific code, have a look at numpy. What you're doing has probably already been done (or rather, it uses LAPACK for things like SVD). When you hear about python being used for scientific code, people probably aren't referring to using it in the way you do in your example.

As a quick example:

(If you're using python3, your example would use float division. My example assumes you're using python2.x, and therefore integer division. If not, specify i = np.arange(9999, dtype=np.float), etc)

import numpy as np
i = np.arange(9999)
j = np.arange(1, 9999)
print np.divide.outer(i,j).sum()

To give some idea of timing... (I'll use floating point division here, instead of integer division as in your example):

import numpy as np

def f1(num):
    total = 0.0
    for i in range(num): 
        for j in range(1, num):
            total += (float(i) / j)
    return total

def f2(num):
    i = np.arange(num, dtype=np.float)
    j = np.arange(1, num, dtype=np.float)
    return np.divide.outer(i, j).sum()

def f3(num):
    """Less memory-hungry (and faster) version of f2."""
    total = 0.0
    j = np.arange(1, num, dtype=np.float)
    for i in xrange(num):
        total += (i / j).sum()
    return total

If we compare timings:

In [30]: %timeit f1(9999)
1 loops, best of 3: 27.2 s per loop

In [31]: %timeit f2(9999)
1 loops, best of 3: 1.46 s per loop

In [32]: %timeit f3(9999)
1 loops, best of 3: 915 ms per loop

edited Nov 11 '11 at 17:36

answered Nov 11 '11 at 17:17

Joe Kington

275,208
71
604
463

Thanks. Nice but If I make larger than 9999, for example 99999? It is not so flexible? – Baskaya Nov 11 '11 at 17:25
Make it as large as you like. Keep in mind that this particular example will generate an x by x array, though, so you'll eventually run into memory constraints. In that case, you can iterate over one axis (i.e. one loop instead of nested loops) and still use vectorized operations. – Joe Kington Nov 11 '11 at 17:29
I asked because Python says `ValueError: array is too big.` – Baskaya Nov 11 '11 at 17:34
I added an example of iterating over one axis instead of both. `f3` is much less memory-hungry, and should scale quite well. – Joe Kington Nov 11 '11 at 17:37
You should have gotten a `MemoryError`, not a `ValueError`... Are you using numpy or something else? (Python as an `array` class that's something different altogether. People often get it and `numpy.array` confused.) – Joe Kington Nov 11 '11 at 17:38
I am using numpy `Traceback (most recent call last): File "b.py", line 4, in print np.divide.outer(i,j).sum() ValueError: array is too big.` – Baskaya Nov 11 '11 at 18:12
What makes using xrange in `f3` faster than numpy functions in `f2`? I guess the `/` in `f3` is also a numpy operator (also an `__rdiv__` seems to be in effect) and `.sum` is from numpy, too. – n611x007 Jan 29 '13 at 10:17
2

@naxa - Sorry I didn't notice your comment earlier! The reason that `f2` is slower for large numbers is that it builds a `num` x `num` 2D array which eats up a lot of memory. `f3` only builds up a `num`-length 1D array and repeatedly makes a copy of it to do the division on. Memory-allocation for a huge array is actually the bottleneck here, so `f3` is faster, as it's easier to find contiguous space in memory for a small array. – Joe Kington Oct 21 '13 at 16:04
I love this straightforward example of using Scipy. Is there any book comes with the name like "Think Scipy" or "Think Numpy" you can suggest? Using loop is the normal fault for all people who bring their coding skill in C or Java to Python. Let Python call other (optimized) module to do the dirty jobs, that's the pythonic way :-) – Jim Raynor Apr 25 '14 at 13:44
2

@JimRaynor - There may a book along those lines, but I don't know of it. Scipy-lectures is an excellent overview of the entire ecosystem: http://scipy-lectures.github.io/ However, it's not quite what you had in mind. Another good option is to follow [Jamie's](http://stackoverflow.com/search?q=user:110026%20%5Bnumpy%5D), [unutbu's](http://stackoverflow.com/search?q=user%3A190597+[numpy]), [Sven's](http://stackoverflow.com/search?q=user:279627+[numpy]), or [DSM's](http://stackoverflow.com/search?q=user:487339+[numpy]) (etc) numpy answers. – Joe Kington Apr 25 '14 at 15:33
This is called vectorization - redefining your problem in terms of vector-vector, matrix-vector, or matrix-matrix operations and calling a linear algebra library to solve it. – anthonybell Apr 19 '18 at 17:10

Mason Ji Ming · Answer 3 · 2020-01-28T14:30:17.843

I think NumPy can be faster than CPython for loops (I didn't test in PyPy).

I want to start from Joe Kington's code because this answer used NumPy.

%timeit f3(9999)
704 ms ± 2.33 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

by myself:

def f4(num):
    x=np.ones(num-1)
    y=np.arange(1,num)
    return np.sum(np.true_divide(x,y))*np.sum(y)

155 µs ± 284 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In addition, High School Mathematics can simplify the problem to computer.

Problem= (1+2+...+(num-1)) * (1/1+1/2+...+1/(num-1))
1+2+...+(num-1)=np.sum(np.arange(1,num))=num*(num-1)/2
1/1+1/2+...+1/(num-1)=np.true_divide (1,y)=np.reciprocal(y.astype(np.float64))

Therefore,

def f5(num):
    return np.sum(np.reciprocal(np.arange(1, num).astype(np.float64))) * num*(num-1)/2
%timeit f5(9999)
106 µs ± 615 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In addition, University Mathematics can simplify the problem to computer more.

1/1+1/2+...+1/(num-1)=np.log(num-1)+1/(2*num-2)+np.euler_gamma
(n>2)

np.euler_gamma: Euler-Mascheroni constant (0.57721566...)

Because of inaccuracy of Euler-Mascheroni constant in NumPy, You lose accuracy like 489223499.9991845 -> 489223500.0408554. If You can ignore 0.0000000085% inaccuracy, You can save more time.

def f6(num):
    return (np.log(num-1)+1/(2*num-2)+np.euler_gamma)* num*(num-1)/2
%timeit f6(9999)
4.82 µs ± 29.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Benefit of NumPy becomes larger with larger input.

%timeit f3(99999)
56.7 s ± 590 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit f5(99999)
534 µs ± 86.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit f5(99999999)
1.42 s ± 15.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
9.498947911958**416**e+16
%timeit f6(99999999)
4.88 µs ± 26.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
9.498947911958**506**e+16
%timeit f6(9999999999999999999)
17.9 µs ± 921 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In special case, You can use numba (Unfortunately not always).

from numba import jit
@jit
def f7(num):
    return (np.log(num-1)+1/(2*num-2)+np.euler_gamma)* num*(num-1)/2
# same code with f6(num)

%timeit f6(999999999999999)
5.63 µs ± 29.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
f7(123) # compile f7(num)
%timeit f7(999999999999999)
331 ns ± 1.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit f7(9999)
286 ns ± 3.09 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

So, I recommend to use NumPy, mathematics and numba together.

score 7 · Answer 4 · edited Apr 26 '15 at 02:14

Python for loops are statically typed and interpreted. Not compiled. Java is faster because it has extra JIT acceleration features that Python does not have.

http://en.wikipedia.org/wiki/Just-in-time_compilation

To illustrate just how massive a difference Java JIT makes, look at this python program that takes about 5 minutes:

if __name__ =='__main__':
    total = 0.0
    i=1
    while i<=9999:
        j=1
        while j<=9999:
            total=1
            j+=1
        i+=1
    print total

While this fundamentally equivalent Java program takes about 23 milliseconds:

public class Main{
    public static void main(String args[]){
        float total = 0f; 

        long start_time = System.nanoTime();
        int i=1;

        while (i<=9999){
            int j=1;
            while(j<=9999){
                total+=1;
                j+=1;
            }
            i+=1;
        }
        long end_time = System.nanoTime();

        System.out.println("total: " + total);
        System.out.println("total milliseconds: " + 
           (end_time - start_time)/1000000);
    }
}

In terms of doing anything in a for loop, Java cleans python's clock by being between 1 and 1000 orders of magnitude faster.

Moral of the story: basic python for loops should be avoided at all costs if speedy performance is required. This could be because Guido van Rossum wants to encourage people to use multi-processor friendly constructs like array splicing, which operate faster than Java.

those while loops are not idiomatic python. Usung for _ in loops this code is almost 3 times faster using CPython and only a couple of seconds using pypy. — Kr0e, Aug 29 '13 at 11:17
"between 1 and 1000 orders of magnitude faster." .. so up to 1E1000 times faster? A loop-intensive operation that previously took one second, in Java takes only 1E-1000 seconds? That sounds quite attractive. — gwideman, Feb 13 '19 at 03:45

score 6 · Answer 5 · answered May 30 '14 at 15:29

6

The benefit of Python is that there is a lot more flexibility (e.g. classes are objects) compared to Java (where you only have this reflection mechanism)

What's not mentioned here is Cython. It allows to introduce typed variables and trans-compile your example to C/C++. Then it's much faster. I've also changed the bounds in the loop ...

from __future__ import division

cdef double total = 0.00
cdef int i, j
for i in range(9999):
    for j in range(1, 10000+i):
        total += (i / j)

from time import time
t = time()
print("total = %d" % total)
print("time = %f[s]" % (time() - t))

Followed by

$ cython loops.pyx
$ gcc -I/usr/include/python2.7 -shared -pthread -fPIC -fwrapv -Wall -fno-strict-aliasing -O3 -o loops.so loops.c
$ python -c "import loops"

gives

total = 514219068
time = 0.000047[s]

answered May 30 '14 at 15:29

Harald Schilly

1,098
1
14
15

1

Are you sure it's 0.000047s? On my machine (i7 intel CPU), it's 0.42s. – qed Dec 07 '14 at 21:30
i tested on raspberry pi total = 514219068 time = 0.000365[s] – qwerty Jan 27 '16 at 11:56
8

It looks like you defined `t = time()` right before printing `time() - t`. You need to start your timer before the loop. Or is there something I misunderstand about how the file is being compiled? – Overflow289 Jul 08 '19 at 05:41

score 5 · Answer 6 · answered Nov 11 '11 at 17:06

5

This is a known phenomenon -- python code is dynamic and interpreted, java code is statically typed and compiled. No surprises there.

The reasons people give for preferring python are often:

smaller code base
less redundancy (more DRY)
cleaner code

However, if you use a library written in C (from python), the performance may be much better (compare: pickle to cpickle).

answered Nov 11 '11 at 17:06

Matt Fenwick

48,199
22
128
192

Yes I know that but Why so many people use it for scientific research? Just the library supports? It is enough? and how can decrease the execution time because even a little loop is that slow. I know Python code is like a pseudocode [very readable] but lots of scientific projects use bigger loop than this. I can write lots of readable, small, cleaner code but their execution times go infinity :) – Baskaya Nov 11 '11 at 17:10
@Thorn -- some people use it for the reasons given, other people may have different reasons. Code size is *very* important, as it's directly related to how long the project takes to complete. Performance is *not* the only or most important factor to many people. – Matt Fenwick Nov 11 '11 at 17:12
OK. I accept and respect your reasons totally. But please tell me, is the loop above big for a scientific project? – Baskaya Nov 11 '11 at 17:17
@Thorn -- it's not the size of the loop, it's the fact that it's a nested loop, so it does `total += i / j` (10 ** 4) * (10 ** 4) = 10 ** 8 times! Is that a lot? Maybe -- how many protein structures are deposited in the PDB? 10 ** 4? How many atoms does a protein have -- 10 ** 3 or 10 ** 4? How big is the human genome -- 10 ** 9? – Matt Fenwick Nov 11 '11 at 17:19
1

@Thorn: It's big, though not too outlandish for bigger scientific computing projects. That's why people who do such projects identify critical loops and push them into C code. The use Python because (1) it works very well for the other 80% of their code and (2) has neat scientific libraries that either have already solved their specific problem or are easy enough to glue together with Python to do what they need (NumPy). The result may well be faster than what they'd have written themselves (NumPy for instance is backed by vast amounts of hand-written and -optimized C and FORTRAN code). – Nov 11 '11 at 17:28
@MattFenwick Yes, it is big for related projects. But for example if you use kNN on Netflix or 10M Movielens Data, and you want to evaluate the accuracy of your Recommender System's recommendations by using 'Precision & Recall' approach, you see this loops are not so big. – Baskaya Nov 11 '11 at 17:32
even Jave is interpreted by Jvm – vighnesh153 Sep 20 '19 at 06:19

kindall · Answer 7 · 2011-11-11T17:32:56.713

You will find that list comprehensions or generator expressions are significantly faster. For example:

total = sum(i / j for j in xrange(1, 9999) for i in xrange(9999))

This executes in ~11 seconds on my machine vs. ~26 for your original code. Still an order of magnitude slower than the Java, but that's more in line with what you'd expect.

Your original code can, by the way, be sped up slightly by initializing total to 0 rather than 0.0 to use integer rather than floating-point addition. Your divisions all have integer results, so there is no point in summing the results to a float.

On my machine, Psyco actually slows down the generator expressions to about the same speed as your original loop (which it does not accelerate at all).

score 5 · Answer 8 · answered Nov 24 '11 at 23:16

Using kindall's list comprehension

total = sum(i / j for j in xrange(1, 9999) for i in xrange(9999))

is 10.2 seconds and using pypy 1.7 it is 2.5 seconds. It is funny because pypy speeds up original version to 2.5 seconds also. So for pypy list comprehensions would be premature optimization ;). Good job pypy!

score 3 · Answer 9 · answered Jul 11 '17 at 14:32

3

Not sure if the recomendation has been made, but I like replacing for loops with list comprehension. Its faster, cleaner, and more pythonic.

http://www.pythonforbeginners.com/basics/list-comprehensions-in-python

answered Jul 11 '17 at 14:32

Jesse

67
3

score 2 · Answer 10 · answered May 24 '14 at 18:47

2

Doing scientific calculations with python often means using some calculation software written in C/C++ in the most crucial parts, with python as internal script language, as e.x. Sage (which contains also a lot of python code, too).

I think that this may be useful: http://blog.dhananjaynene.com/2008/07/performance-comparison-c-java-python-ruby-jython-jruby-groovy/

As you can see, psyco/PyPy can bring a certain improvement, but still, it probably would be much slower than C++ or Java.

answered May 24 '14 at 18:47

Krystian

191
1
1

1

What Sage actually does is to code a lot in Cython. That's a Python-like language (looks like it and mixed into same source code) but is trans-compiled to a C/C++ Python extension. – Harald Schilly May 30 '14 at 15:15

Michele Folgheraiter · Answer 11 · 2021-04-22T15:47:47.453

If you use While Loops instead of For Loops the execution will be much much faster (tested in Python 3). It will run as faster as a compiled C program that do the same thing. Try the following examples (MIPS calculation is only indicative because does not consider the processor's architecture etc. etc.):

Python 3 Program


import time


N=100
i=0
j=0

StartTime=time.time()
while j<N:
    j=j+1
    while i<1000000:
        a=float(i)/float(j)
        i=i+1
EndTime=time.time()

DeltaTime=(EndTime-StartTime) # time in seconds


MIPS=(1/DeltaTime)*N



print("This program estimates the MIPS that your computational unit can perform")
print("------------------------------------------")
print("Execution Time in Seconds=",DeltaTime)
print("MIPS=",MIPS) 
print("------------------------------------------")

C Program

#include <stdio.h>
#include <time.h>


int main(){

int i,j;
int N=100;
float a, DeltaTime, MIPS;
clock_t StartTime, EndTime;

StartTime=clock();

// This calculates n-time one million divisions

for (j=1;j<N; j++)
 {
    for(i=1;i<1000000;i=i+1)
     {
      a=(float)(i)/(float)(j);
     }
 }


EndTime=clock(); // measures time in microseconds

DeltaTime=(float)(EndTime - StartTime)/1000000;

MIPS=(1/DeltaTime)*N;

printf("------------------------------------------\n");
printf("Execution Time in Seconds=%f \n", DeltaTime);
printf("MIPS=%f \n", MIPS);
printf("------------------------------------------\n");

return 0;

}

I ran 3 nested loops with a range of 300 and both the while and for loop performed equally in python 3.8. — Paul Kenjora, Jan 07 '22 at 02:49

Why Python is so slow for a simple for loop?

EDIT

11 Answers11

Python 3 Program

C Program

Linked