Why is using a Cython list faster than using a Python list?

Question

Here's my Python code:

X = [[0] * 1000] * 100
start = time()
for x in xrange(100):
    for i in xrange(len(X)):
        for j in xrange(len(X[i])):
            X[i][j] += 1
print  time() - start

My Cython code is the same:

X = [[0] * 1000] * 100
start = time()
for x in xrange(100):
    for i in xrange(len(X)):
        for j in xrange(len(X[i])):
            X[i][j] += 1
print  time() - start

Output:

Python cost: 2.86 sec
Cython cost: 0.41 sec

~~Any other faster way to do the same above in Python or Cython?~~

Update: Any way to create an 2d array X with height indexing performance close to array int X[][] that in C/C++?

For now I am considering use Python C API to do the job.

One more thing, a numpy array does the same thing but so much slower(70 sec) than list in both pure Python and Cython.

Python:

X = np.zeros((100,1000),dtype=np.int32)
start = time()
for x in xrange(100):
    for i in xrange(len(X)):
        for j in xrange(len(X[i])):
            X[i][j]+=1

If doing a lot of access to numeric array, which approach is the best?

A previous question regarding why it's faster and the price to pay for speed http://stackoverflow.com/questions/2697275/cython-speed-boost-vs-usability for interest — FinlayL, Aug 19 '14 at 08:35
Also another interesting related question/answer http://stackoverflow.com/questions/7799977/numpy-vs-cython-speed?rq=1 — FinlayL, Aug 19 '14 at 08:41
numpy can be very fast if used as intended, i.e. by writing *vectorized* code. Your numpy example could be written simply as `Z = np.zeros((100, 1000), np.int32); X += 1` (or even more trivially, as `np.ones((100, 1000), np.int32)`), which should be pretty damn fast (~85us for the inplace addition on my machine). If you post the actual function you're trying to optimize then there's a good chance we could help you to write a vectorized version. — ali_m, Aug 19 '14 at 14:57
@ali_m, I had tried the np.array, but it not faster. I have posted the code here. the sample_topic function is the function i ma trying to optimize. https://github.com/darlinglele/lda_topic_model/blob/master/topic_model.py — lessisawesome, Aug 20 '14 at 10:20
I'm not surprised - literally none of that code is vectorized! It's not the array container itself that's any faster (accessing individual elements is actually slower than with a list), but rather it lets you apply operations to multiple array elements at once, allowing you to avoid looping over the elements in Python. To get the most out of numpy you really have to stop treating arrays like nested lists and get your head around the concept of vectorized operations ([see here for a tutorial](http://www.sam.math.ethz.ch/~raoulb/teaching/PythonTutorial/intro_numpy.html#operation-on-arrays)). — ali_m, Aug 20 '14 at 10:51
@ali_m, yes, there is no operations between arrays. I just need raw array that support quick index. — lessisawesome, Aug 20 '14 at 11:17
I think almost all of that code could be vectorized. Try taking your example to [Code Review](http://codereview.stackexchange.com) if you need advice on how to do so. — ali_m, Aug 20 '14 at 11:33

score 4 · Answer 1 · edited May 23 '17 at 12:13

To answer the question in your title, your Cython code beats your Python code because, despite the lack of cdef to declare variables, C code is being generated for the for loops (in addition to lots of extra C code to describe the Python objects). To speed up your Cython code use cdef to declare the integers i, j and x so they're no longer Python integers: e.g. cdef int i. You can also declare C-type arrays in Cython which should improve performance further.

A quick way to get the same result with NumPy:

X = np.zeros((100, 1000), dtype=np.int32)
X += 10000

If you can help it, you should never use for loops with NumPy arrays. They are completely different to lists in terms of memory usage.

score 1 · Answer 2 · answered Aug 19 '14 at 09:16

1

Any other faster way to do the same above in Python or Cython?

The equivalent, faster code would be:

X = [[100 * 100] * 1000] * 100

In your code, you are creating a 1000-long list of zeroes, then creating a 100-long list of references to that list. Now, iterating 100 times over that 100-long list results in each position being incremented 100 * 100 = 10000 times.

len(set(map(id, X)))
1

If you would like to end up with a list of 100 lists:

base = [100] * 1000
X = [list(base) for _ in xrange(100)]
len(set(map(id, X)))
100

Note that references to objects inside lists are still copied.

answered Aug 19 '14 at 09:16

Maciej Gol

15,394
4
33
51

Thank you kroolik, your code is exact right. It's my fault that didn't described my thought clear. I want to find a fast way to use 2d array with hight performance.like int a[][] in C code. – lessisawesome Aug 19 '14 at 10:42
@lessisawesome, for what purpose? The most generic one is the one you've posted - double loop with the code you need in the inner loop, in Cython. Any more optimizations can be done with a specific use-case in mind. – Maciej Gol Aug 19 '14 at 12:05

score 0 · Answer 3 · answered Aug 19 '14 at 12:46

ajcr's answer probably is the fastest and simpliest one. You should first explicitly declare the datatype of your variables in the cython code. In addition I would make a prange instead of a simple range iterator for the outer loop. This will activate OpenMP multithreading what might speed up your code further but I really doubt that this solution will beat the numpy implementation.

Why is using a Cython list faster than using a Python list?

3 Answers3