Why is numpy slower than for loop

Question

Update: this feature is now in sciPy.stats.qmc.discrepancy and was ported to Cython and also parallelised.

I have a function using some for loops and I wanted to improve the speed using numpy. But this seems not to do the trick as the bumpy version appears to be 2 times slower. Here is the code:

import numpy as np
import itertools
import timeit

def func():
    sample = np.random.random_sample((100, 2))
    
    disc1 = 0
    disc2 = 0
    n_sample = len(sample)
    dim = sample.shape[1]

    for i in range(n_sample):
        prod = 1
        for k in range(dim):
            sub = np.abs(sample[i, k] - 0.5)
            prod *= 1 + 0.5 * sub - 0.5 * sub ** 2
    
        disc1 += prod

    for i, j in itertools.product(range(n_sample), range(n_sample)):
        prod = 1
        for k in range(dim):
            a = 0.5 * np.abs(sample[i, k] - 0.5)
            b = 0.5 * np.abs(sample[j, k] - 0.5)
            c = 0.5 * np.abs(sample[i, k] - sample[j, k])
            prod *= 1 + a + b - c
        disc2 += prod

    c2 = (13 / 12) ** dim - 2 / n_sample * disc1 + 1 / (n_sample ** 2) * disc2


def func_numpy():
    sample = np.random.random_sample((100, 2))

    disc1 = 0
    disc2 = 0
    n_sample = len(sample)
    dim = sample.shape[1]

    disc1 = np.sum(np.prod(1 + 0.5 * np.abs(sample - 0.5) - 0.5 * np.abs(sample - 0.5) ** 2, axis=1))
    
    for i, j in itertools.product(range(n_sample), range(n_sample)):
        disc2 += np.prod(1 + 0.5 * np.abs(sample[i] - 0.5) + 0.5 * np.abs(sample[j] - 0.5) - 0.5 * np.abs(sample[i] - sample[j]))
    
    c2 = (13 / 12) ** dim - 2 / n_sample * disc1 + 1 / (n_sample ** 2) * disc2


print('Normal function time: ' , timeit.repeat('func()', number=20, repeat=5, setup="from __main__ import func"))
print('numpy function time: ', timeit.repeat('func_numpy()', number=20, repeat=5, setup="from __main__ import func_numpy"))

The timing output is:

Normal function time:  [2.831496894999873, 2.832342429959681, 2.8009242500411347, 2.8075121529982425, 2.824807019031141]
numpy function time:  [5.154757721000351, 5.2011515340418555, 5.148996959964279, 5.095560318033677, 5.125199959962629]

What am I missing here? I know that the bottleneck is the itertools part because I have a 100x100x2 loop instead of a 100x2 loop before. Do you see another way to do that?

Have you checked that there are no unwanted broadcast operations? — cs95, Jun 07 '17 at 20:43
You could print `x.shape` to ensure that the operations are done properly and the output is of the expected shape/size. — cs95, Jun 07 '17 at 20:46
Try profiling your code and see where the bottleneck is. For example in your numpy version, you can compute `np.abs(sample - 0.5)` once and re-use it instead of computing it twice... — sirfz, Jun 07 '17 at 20:48
There's a lot that could be improved in each version. It really doesn't make sense to compare such not-optimal solutions. Try to optimize each individually before comparing them against each other. — MSeifert, Jun 07 '17 at 20:51
That itertools.product doesn't seem the most numpy friendly thing to do. You're going in and out numpy paying overhead and who knows what else. — Ignacio Vergara Kausel, Jun 07 '17 at 20:51
Your itertools loop generates 100*100 steps, while the `dim` loop is only 2 steps. So you aren't making any significant use of `numpy` array fuctionality. — hpaulj, Jun 07 '17 at 20:51
Indeed it is the itertools loop that is the botleneck. @MSeifert could you elaborate? I though these two were OK... — tupui, Jun 07 '17 at 20:54
@sirfz I did it and it is obviously the itertool loop that is doing all the job. I have tried your suggestion (np.abs) with no luck. — tupui, Jun 07 '17 at 20:57

Divakar · Accepted Answer · 2017-06-07T22:09:51.950

With NumPy, one must look to vectorize things and we could certainly do so here.

Taking a closer look at the loop portion, we are iterating along the first axis of the input data samples twice with that loop startup :

for i, j in itertools.product(range(n_sample), range(n_sample)):

We could convert these iterations into vectorized operations, once we let broadcasting handle those.

Now, to have a fully vectorized solution, we would need a lot more memory space, specifically (N,N,M), where (N,M) is the shape of the input data.

Another noticeable aspect here is that at each iteration, we aren't doing a lot of work, as we are performing operation on each row and each row contains only 2 elements for the given sample. So, the idea that comes out is that we could run a loop along M, such that at each iteration, we would compute the prod and accumulate. Thus, for the given sample, its just two loop iterations.

Getting out of the loop, we would have the accumulated prod, that just needs summation for disc2 as the final output.

Here's an implementation to fulfil the above mentioned ideas -

prod_arr = 1
for i in range(sample.shape[1]):
    si = sample[:,i]
    prod_arr *= 1 + 0.5 * np.abs(si[:,None] - 0.5) + 0.5 * np.abs(si - 0.5) - \
                                    0.5 * np.abs(si[:,None] - si)
disc2 = prod_arr.sum()

Runtime test

The stripped down version of the loopy portion from the original approach and the modified versions as approaches are listed below :

def org_app(sample):
    disc2 = 0
    n_sample = len(sample)
    for i, j in itertools.product(range(n_sample), range(n_sample)):
        disc2 += np.prod(1 + 0.5 * np.abs(sample[i] - 0.5) + 0.5 * \
            np.abs(sample[j] - 0.5) - 0.5 * np.abs(sample[i] - sample[j]))
    return disc2


def mod_app(sample):
    prod_arr = 1
    for i in range(sample.shape[1]):
        si = sample[:,i]
        prod_arr *= 1 + 0.5 * np.abs(si[:,None] - 0.5) + 0.5 * np.abs(si - 0.5) - \
                                        0.5 * np.abs(si[:,None] - si)
    disc2 = prod_arr.sum()
    return disc2

Timings and verification -

In [10]: sample = np.random.random_sample((100, 2))

In [11]: org_app(sample)
Out[11]: 11934.878683659041

In [12]: mod_app(sample)
Out[12]: 11934.878683659068

In [14]: %timeit org_app(sample)
10 loops, best of 3: 84.4 ms per loop

In [15]: %timeit mod_app(sample)
10000 loops, best of 3: 94.6 µs per loop

About 900x speedup! Well this should be motivating enough, hopefully to look to vectorize things whenever possible.

Generally better to include `()` instead of using backslashes when using multiline expressions. — MSeifert, Jun 07 '17 at 21:33
It's removing the backslashes from my comments. But I meant the line that starts with: `prod_arr *= 1 + 0.5 * ...` — MSeifert, Jun 07 '17 at 21:34
@MSeifert Doesn't allow me to avoid backslash on my Spyder IDE. An IDE thing? — Divakar, Jun 07 '17 at 21:42
@Divakar Strange, doesn't the version I included in my answer work (I used the `()` without backslash. My IDE hates backslashes at the end of lines. :D — MSeifert, Jun 07 '17 at 21:44
If anyone is reading this now: I pushed the feature is in `scipy.stats.qmc.discrepancy`. Thanks again for the help! — tupui, Jan 21 '22 at 16:52

MSeifert · Answer 2 · 2017-06-07T21:31:10.820

As I mentioned in the comments your solutions are not really optimal and it doesn't really make sense to compare not-ideal approaches.

For one thing iterating or indexing single elements of a NumPy array is really slow. I recently answered a question including a lot of details (if you're interested you might have a look at it: "convert np array to a set takes too long"). So the Python approach could be faster simply by converting the array to a list:

def func():
    sample = np.random.random_sample((100, 2))
    disc1 = 0
    n_sample = len(sample)
    dim = sample.shape[1]
    sample = sample.tolist()  # converted to list

    for i in range(n_sample):
        prod = 1
        for item in sample[i]:
            sub = abs(item - 0.5)
            prod *= 1 + 0.5 * sub - 0.5 * sub ** 2
        disc1 += prod

    disc2 = 0
    for i, j in itertools.product(range(n_sample), range(n_sample)):
        prod = 1
        for k in range(dim):
            a = 0.5 * abs(sample[i][k] - 0.5)
            b = 0.5 * abs(sample[j][k] - 0.5)
            c = 0.5 * abs(sample[i][k] - sample[j][k])
            prod *= 1 + a + b - c
        disc2 += prod

    c2 = (13 / 12) ** dim - 2 / n_sample * disc1 + 1 / (n_sample ** 2) * disc2

I also replaced the np.abs calls with normal abs. The normal abs has a lower overhead! And also changed some other parts. In the end this is more than 10-20 times faster than your original "normal" approach.

I didn't have time to check the NumPy approach yet and @Divarkar already included a really good and optimized approach. Comparing the two approaches:

def func_numpy():
    sample = np.random.random_sample((100, 2))

    disc1 = 0
    disc2 = 0
    n_sample = len(sample)
    dim = sample.shape[1]

    disc1 = np.sum(np.prod(1 + 
                           0.5 * np.abs(sample - 0.5) - 
                           0.5 * np.abs(sample - 0.5) ** 2, 
                           axis=1))

    prod_arr = 1
    for i in range(sample.shape[1]):
        s0 = sample[:,i]
        prod_arr *= (1 + 
                     0.5 * np.abs(s0[:,None] - 0.5) + 
                     0.5 * np.abs(s0 - 0.5) - 
                     0.5 * np.abs(s0[:,None] - s0))
    disc2 = prod_arr.sum()

    c2 = (13 / 12) ** dim - 2 / n_sample * disc1 + 1 / (n_sample ** 2) * disc2


print('Normal function time: ' , 
      timeit.repeat('func()', number=20, repeat=3, setup="from __main__ import func"))
# Normal function time:  [1.4846746248249474, 1.5018398493266432, 1.5476674017127152]
print('numpy function time: ', 
      timeit.repeat('func_numpy()', number=20, repeat=3, setup="from __main__ import func_numpy"))
# numpy function time:  [0.020140038561976326, 0.016502230831292763, 0.016452520269695015]

So an optimized NumPy approach can definetly beat an "optimized" Python approach. It's almost 100 times faster. In case you want it even faster you could use numba on a slightly modified version of the pure python code:

import numba as nb

@nb.njit
def func_numba():
    sample = np.random.random_sample((100, 2))
    disc1 = 0
    n_sample = len(sample)
    dim = sample.shape[1]

    for i in range(n_sample):
        prod = 1
        for item in sample[i]:
            sub = abs(item - 0.5)
            prod *= 1 + 0.5 * sub - 0.5 * sub ** 2
        disc1 += prod

    disc2 = 0
    for i in range(n_sample):
        for j in range(n_sample):
            prod = 1
            for k in range(dim):
                a = 0.5 * abs(sample[i,k] - 0.5)
                b = 0.5 * abs(sample[j,k] - 0.5)
                c = 0.5 * abs(sample[i,k] - sample[j,k])
                prod *= 1 + a + b - c
            disc2 += prod

    return (13 / 12) ** dim - 2 / n_sample * disc1 + 1 / (n_sample ** 2) * disc2

func_numba()


print('numba function time: ' , 
      timeit.repeat('func_numba()', number=20, repeat=3, setup="from __main__ import func_numba"))
# numba function time:  [0.003022848984983284, 0.0030429566279508435, 0.004060626777572907]

That's almost a factor 8-10 faster than the NumPy approach.

@Y0da I also included an additional approach and timed it against Divarkars solution. — MSeifert, Jun 07 '17 at 21:31
I cannot vote for two answers but yours is also definitely worth looking at! Thanks for the numba approach. — tupui, Jun 07 '17 at 21:36
@Y0da You can vote on both answers - but you can only **accept** one. And Divarkars answer definetly deserves the **accept**. :) — MSeifert, Jun 07 '17 at 21:37
I'm tired of thinking about broadcasting stuff and worrying about `MemoryError`s - so Numba, Cython or (if possible) numexpr are in most cases much handier than pure-NumPy :) Unfortunatly Cython is not really appropriate on StackOverflow because the compilation is complicated (if one doesn't use IPythons magic `%%cython`). — MSeifert, Jun 07 '17 at 21:45
@MSeifert With the broadcasting, why numba is faster? I though that behind numpy it was C so there should be no speedup for me. — tupui, Jun 07 '17 at 22:47
@Y0da Numba uses LLVM to compile the code, so the loops are basically at C speed and it doesn't use python types but C types which makes the arithmetic as fast as if you would use NumPy. It's faster because it doesn't create temporary arrays (each operation on a numpy array like `array + 2` creates a new array) and it avoids the excessive memory use that broadcasting needs (broadcasting creates a huge intermediate array). — MSeifert, Jun 07 '17 at 22:51
However numba doesn't support all (in fact it only supports a few) python and numpy operations and it requires a lot of trial and error to actually make an operation as fast as possible. In this case it's quite easy but generally it can be quite hard to find a numba implementation that is actually (much) faster than NumPy. — MSeifert, Jun 07 '17 at 22:54

Why is numpy slower than for loop

2 Answers2

Linked