numpy ufuncs speed vs for loop speed

Question

I've read a lot "avoid for loops with numpy". So, I tried. I was using this code (simplified version). Some auxiliary data:

 In[1]: import numpy as np
        resolution = 1000                             # this parameter varies
        tim = np.linspace(-np.pi, np.pi, resolution) 
        prec = np.arange(1, resolution + 1)
        prec = 2 * prec - 1
        values = np.zeros_like(tim)

My first implementation was with for loop:

 In[2]: for i, ti in enumerate(tim):
           values[i] = np.sum(np.sin(prec * ti))

Then, I got rid of the explicit for cycle, and achieved this:

 In[3]: values = np.sum(np.sin(tim[:, np.newaxis] * prec), axis=1)

And this solution was faster for small arrays, but when I scaled up I got such time dependence:

What I'm missing or is it normal behavior? And if it is not, where to dig?

EDIT: According to the comments, here is some additional information. The times were measured with IPython's %timeit and %%timeit, every run was performed on the fresh kernel. My laptop is acer aspire v7-482pg (i7, 8GB). I'm using:

python 3.5.2
numpy 1.11.2 + mkl
Windows 10

Really, I am building square waves, but not to pollute with the coefficients in question, I've simplified example. — godaygo, Dec 26 '16 at 00:25
How much RAM do you have? If it is not big enough, `tim[:, np.newaxis] * prec` might require swap space, which would lead to a performance hit. — unutbu, Dec 26 '16 at 00:43
@unutbu I have 8GB. Here size = resolution*resolution. Yes, I thought so too that the more cycles the more difference will be. But ... That's why I'm asking:) — godaygo, Dec 26 '16 at 00:50
If you switch the order in which you are running the `timeit()` does the result stay the same? (eg. instead of the loop as the first one being run, make it the second test). I tend to get different results everytime. — l'L'l, Dec 26 '16 at 00:52
@l'L'l Yes I checked this issue, the results are the same. I finished trying with the fresh kernel. — godaygo, Dec 26 '16 at 00:53
Yes I've observed this in other SO questions (I'll try to find a recent one). Your loop case is operating on chunks of size `resolution`. The other on arrays of (roughly) `resolution**2`. For large sizes, the memory management for those large arrays roughly balances out the iteration costs. For large `resolution` the calculation time for one chunk is large relative to the iteration overhead. — hpaulj, Dec 26 '16 at 02:56
http://stackoverflow.com/questions/41148080/how-do-i-speed-up-profiled-numpy-code-vectorizing-numba - a similar case looking at iteration with `dot` v. `einsum` and `matmul` from a couple weeks back. — hpaulj, Dec 26 '16 at 03:05

MSeifert · Accepted Answer · 2016-12-31T01:01:50.073

This is normal and expected behaviour. It's just too simplified to apply the "avoid for loops with numpy" statement everywere. If you're dealing with inner loops it's (almost) always true. But in case of outer loops (like in your case) there are far more exceptions. Especially if the alternative is to use broadcasting because this speeds up your operation by using a lot more memory.

Just to add a bit of background to that "avoid for loops with numpy" statement:

NumPy arrays are stored as contiguous arrays with c types. The Python int is not the same as a C int! So whenever you iterate over each item in an array you need to plug the item from the array, convert it to a Python int and then do whatever you want to do with it and finally you may need to convert it to a c integer again (called boxing and unboxing the value). For example you want to sum the items in an array using Python:

import numpy as np
arr = np.arange(1000)
%%timeit
acc = 0
for item in arr:
    acc += item
# 1000 loops, best of 3: 478 µs per loop

You better use numpy:

%timeit np.sum(arr)
# 10000 loops, best of 3: 24.2 µs per loop

Even if you push the loop into Python C code you're far away from the numpy performance:

%timeit sum(arr)
# 1000 loops, best of 3: 387 µs per loop

There might be exceptions from this rule but these will be really sparse. At least as long as there is some equivalent numpy functionality. So if you would iterate over single elements then you should use numpy.

Sometimes a plain python loop is enough. It's not widly advertised but numpy functions have a huge overhead compared to Python functions. For example consider a 3-element array:

arr = np.arange(3)
%timeit np.sum(arr)
%timeit sum(arr)

Which one will be faster?

Solution: The Python function performs better than the numpy solution:

# 10000 loops, best of 3: 21.9 µs per loop  <- numpy
# 100000 loops, best of 3: 6.27 µs per loop <- python

But what does this have to do with your example? Not all that much in fact because you always use numpy-functions on arrays (not single elements and not even few elements) so your inner loop already uses the optimized functions. That's why both perform roughly the same (+/- a factor 10 with very few elements to a factor 2 at roughly 500 elements). But that's not really the loop overhead, it's the function call overhead!

Your loop solution

Using line-profiler and a resolution = 100:

def fun_func(tim, prec, values):
    for i, ti in enumerate(tim):
        values[i] = np.sum(np.sin(prec * ti))
%lprun -f fun_func fun_func(tim, prec, values)
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def fun_func(tim, prec, values):
     2       101          752      7.4      5.7      for i, ti in enumerate(tim):
     3       100        12449    124.5     94.3          values[i] = np.sum(np.sin(prec * ti))

95% is spent inside the loop, I've even split the loop body into several parts to verify this:

def fun_func(tim, prec, values):
    for i, ti in enumerate(tim):
        x = prec * ti
        x = np.sin(x)
        x = np.sum(x)
        values[i] = x
%lprun -f fun_func fun_func(tim, prec, values)
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def fun_func(tim, prec, values):
     2       101          609      6.0      3.5      for i, ti in enumerate(tim):
     3       100         4521     45.2     26.3          x = prec * ti
     4       100         4646     46.5     27.0          x = np.sin(x)
     5       100         6731     67.3     39.1          x = np.sum(x)
     6       100          714      7.1      4.1          values[i] = x

The time consumers are np.multiply, np.sin, np.sum here, as you can easily check by comparing their time per call with their overhead:

arr = np.ones(1, float)
%timeit np.sum(arr)
# 10000 loops, best of 3: 22.6 µs per loop

So as soon as the comulative function call overhead is small compared to the calculation runtime you'll have similar runtimes. Even with 100 items you're quite close to the overhead time. The trick is to know at which point they break-even. With 1000 items the call-overhead is still significant:

%lprun -f fun_func fun_func(tim, prec, values)
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def fun_func(tim, prec, values):
     2      1001         5864      5.9      2.4      for i, ti in enumerate(tim):
     3      1000        42817     42.8     17.2          x = prec * ti
     4      1000       119327    119.3     48.0          x = np.sin(x)
     5      1000        73313     73.3     29.5          x = np.sum(x)
     6      1000         7287      7.3      2.9          values[i] = x

But with resolution = 5000 the overhead is quite low compared to the runtime:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def fun_func(tim, prec, values):
     2      5001        29412      5.9      0.9      for i, ti in enumerate(tim):
     3      5000       388827     77.8     11.6          x = prec * ti
     4      5000      2442460    488.5     73.2          x = np.sin(x)
     5      5000       441337     88.3     13.2          x = np.sum(x)
     6      5000        36187      7.2      1.1          values[i] = x

When you spent 500us in each np.sin call you don't care about the 20us overhead anymore.

A word of caution is probably in order: line_profiler probably includes some additional overhead per line, maybe also per function call, so the point at which the function call overhead becomes negligable might be lower!!!

Your broadcast solution

I started by profiling the first solution, let's do the same with the second solution:

def fun_func(tim, prec, values):
    x = tim[:, np.newaxis]
    x = x * prec
    x = np.sin(x)
    x = np.sum(x, axis=1)
    return x

Again using the line_profiler with resolution=100:

%lprun -f fun_func fun_func(tim, prec, values)
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def fun_func(tim, prec, values):
     2         1           27     27.0      0.5      x = tim[:, np.newaxis]
     3         1          638    638.0     12.9      x = x * prec
     4         1         3963   3963.0     79.9      x = np.sin(x)
     5         1          326    326.0      6.6      x = np.sum(x, axis=1)
     6         1            4      4.0      0.1      return x

This already exceeds the overhead time significantly and thus we end up a factor 10 faster compared to the loop.

I also did the profiling for resolution=1000:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def fun_func(tim, prec, values):
     2         1           28     28.0      0.0      x = tim[:, np.newaxis]
     3         1        17716  17716.0     14.6      x = x * prec
     4         1        91174  91174.0     75.3      x = np.sin(x)
     5         1        12140  12140.0     10.0      x = np.sum(x, axis=1)
     6         1           10     10.0      0.0      return x

and with precision=5000:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def fun_func(tim, prec, values):
     2         1           34     34.0      0.0      x = tim[:, np.newaxis]
     3         1       333685 333685.0     11.1      x = x * prec
     4         1      2391812 2391812.0    79.6      x = np.sin(x)
     5         1       280832 280832.0      9.3      x = np.sum(x, axis=1)
     6         1           14     14.0      0.0      return x

The 1000 size is still faster, but as we've seen there the call overhead was still not-negligable in the loop-solution. But for resolution = 5000 the time spent in each step is almost identical (some are a bit slower, others faster but overall quite similar)

Another effect is that the actual broadcasting when you do the multiplication becomes significant. Even with the very smart numpy solutions this still includes some additional calculations. For resolution=10000 you see that the broadcasting multiplication starts to take up more "% time" in relation to the loop solution:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def broadcast_solution(tim, prec, values):
     2         1           37     37.0      0.0      x = tim[:, np.newaxis]
     3         1      1783345 1783345.0    13.9      x = x * prec
     4         1      9879333 9879333.0    77.1      x = np.sin(x)
     5         1      1153789 1153789.0     9.0      x = np.sum(x, axis=1)
     6         1           11     11.0      0.0      return x


Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     8                                           def loop_solution(tim, prec, values):
     9     10001        62502      6.2      0.5      for i, ti in enumerate(tim):
    10     10000      1287698    128.8     10.5          x = prec * ti
    11     10000      9758633    975.9     79.7          x = np.sin(x)
    12     10000      1058995    105.9      8.6          x = np.sum(x)
    13     10000        75760      7.6      0.6          values[i] = x

But there is another thing besides actual time spent: the memory consumption. Your loop solution requires O(n) memory because you always process n elements. The broadcasting solution however requires O(n*n) memory. You probably have to wait some time if you use resolution=20000 with your loop but it would still only require 8bytes/element * 20000 element ~= 160kB but with the broadcasting you'll need ~3GB. And this neglects the constant factor (like temporary arrays aka intermediate arrays)! Suppose you go even further you'll run out of memory very fast!

Time to summarize the points again:

If you do a python loop over single items in a numpy array you're doing it wrong.
If you loop over subarrays of a numpy-array make sure the function call overhead in each loop is negligable compared to the time spent in the function.
If you broadcast numpy arrays make sure you don't run out of memory.

But the most important point about optimization is still:

Only optimize the code if it's too slow! If it's too slow then optimize only after profiling your code.
Don't blindly trust simplified statements and again never optimize without profiling.

One final thought:

Such functions that either require a loop or broadcasting can be easily implemented using cython, numba or numexpr if there is no already existing solution in numpy or scipy.

For example a numba function that combines the memory efficiency from the loop solution with the speed of the broadcast solution at low resolutions would look like this:

from numba import njit

import math

@njit
def numba_solution(tim, prec, values):
    size = tim.size
    for i in range(size):
        ti = tim[i]
        x = 0
        for j in range(size):
            x += math.sin(prec[j] * ti)
        values[i] = x

As pointed out in the comments numexpr can also evaluate the broadcasted calculation very fast and without requiring O(n*n) memory:

>>> import numexpr
>>> tim_2d = tim[:, np.newaxis]
>>> numexpr.evaluate('sum(sin(tim_2d * prec), axis=1)')

The analogy doesn't seem useful. It's not readily apparent how the mechanics of the situation in the analogy correspond to the mechanics of the actual problem, and it's too easy to get the wrong impression and think that the problem is trying to do a lot of work in one call, instead of trying to use a huge working set. — user2357112, Dec 28 '16 at 04:19
@user2357112 Thanks for the feedback! Would it be better to remove it again? It was meant as the metaphor for broadcasting vs. for-loop where broadcasting might lead to memory-errors or long runtimes if one underestimates the sizes. The post got much longer than I intended so removing some part that's suboptimal might be good. :-) — MSeifert, Dec 28 '16 at 04:25
I'd say remove it and instead lead with a quick summary of how chunking the problem with the Python outer loop reduces the memory consumption, while introducing proportionally less overhead as `resolution` increases (because it's an outer loop). — user2357112, Dec 28 '16 at 04:40
@MSeifert Thank you for such detailed response, some places was not obvious for me and Numba is awesome. I think my misunderstanding associated with my background - Mathematica. p.s.: What profiler do you use? — godaygo, Dec 28 '16 at 20:22
@godaygo I'm glad if I the answer was useful, if there is any part that still needs some clarification I will try my best and edit it. I used [line_profiler](https://github.com/rkern/line_profiler) (imported in [ipython](https://github.com/ipython/ipython) with `%load_ext line_profiler` which activates the `%lprun` command). With Ipython > 5.x you need version line_profiler > 2.x (avaiable via pip or conda in the `conda-forge` channel) — MSeifert, Dec 29 '16 at 03:29
@MSeifert I found that `numexpr.evaluate('sum(sin(prec*ti), axis=1)')` is the simplest way to speed up. It will be nice if you can add this to your answer:) — godaygo, Dec 30 '16 at 08:26
@MSeifert I use `tim = tim[:, np.newaxis]; numexpr.evaluate('sum(sin(tim * prec), axis=1)')`. It gives the same result for me. — godaygo, Dec 30 '16 at 15:28

numpy ufuncs speed vs for loop speed

1 Answers1

Your loop solution

Your broadcast solution

Linked