Performance in different vectorization method in numpy

Question

I wanted to test the performance of vectorizing code in python:

import timeit
import numpy as np

def func1():
  x = np.arange(1000)
  sum = np.sum(x*2)
  return sum

def func2():
  sum = 0
  for i in xrange(1000):
    sum += i*2
  return sum

def func3():
  sum = 0
  for i in xrange(0,1000,4):
    x = np.arange(i,i+4,1)
    sum += np.sum(x*2)
  return sum

print timeit.timeit(func1, number = 1000)
print timeit.timeit(func2, number = 1000)
print timeit.timeit(func3, number = 1000)

The code gives the following output:

0.0105729103088
0.069864988327
0.983253955841

The performance difference in the first and second functions are not surprising. But I was surprised that the 3rd function is significantly slower than the other functions.

I am much more familiar in vectorising code in C than in Python and the 3rd function is more C-like - running a for loop and processing 4 numbers in one instruction in each loop. To my understanding numpy calls a C function and then vectorize the code in C. So if this is the case my code is also passing 4 numbers to numpy each at a time. The code shouldn't perform better when I pass more numbers at once. So why is it much more slower? Is it because of the overhead in calling a numpy function?

Besides, the reason that I even came up with the 3rd function in the first place is because I'm worried about the performance of the large amount of memory allocation to x in func1.

Is my worry valid? Why and how can I improve it or why not?

Thanks in advance.

Edit:

For curiosity sake, although it defeats my original purpose for creating the 3rd version, I have looked into roganjosh's suggestion and tried the following edit.

def func3():
  sum = 0
  x = np.arange(0,1000)
  for i in xrange(0,1000,4):
    sum += np.sum(x[i:i+4]*2)
  return sum

The output:

0.0104308128357
0.0630609989166
0.748773813248

There is an improvement, but still a large gap compared with the other functions.

Is it because x[i:i+4] still creates a new array?

Edit 2:

I've modified the code again according to Daniel's suggestion.

def func1():
  x = np.arange(1000)
  x *= 2
  return x.sum()

def func3():
  sum = 0
  x = np.arange(0,1000)
  for i in xrange(0,1000,4):
    x[i:i+4] *= 2
    sum += x[i:i+4].sum()
  return sum

The output:

0.00824999809265
0.0660569667816
0.598328828812

There is another speedup. So the declaration of numpy arrays are definitely a problem.Now in func3 there should be one array declaration only, but yet the time is still way slower. Is it because of the overhead of calling numpy arrays?

Well, `func3` keeps creating new arrays of varying length inside a `for` loop... — roganjosh, May 06 '17 at 17:37
Varying length? The size of the array should be constant of 4. Or are you referring to the dynamic nature of numpy array? Im not too familiar with the implementation of numpy arrays. — Robin Yuen, May 06 '17 at 17:43
Ah yeah, I missed the `i` start, so fixed length. Still, you're creating hundreds of them. A `for` loop still executes in Python time though. — roganjosh, May 06 '17 at 17:44
Well, my testing so far was looking at defining `np.arange(i,i+4,1)` which was about 0.34 secs if I called it 250 times in a loop... but that's an order of magnitude smaller than what `func3` was giving me, which is surprising (since it already includes the `for` loop, `i` indexing and creating an array object each time). For my own curiosity I'm looking into that. — roganjosh, May 06 '17 at 17:59
If you must iterate in Python, choose the smallest dimension. 4 iterations with arrays of size (250,) will be faster than 250 iterations with (4,) arrays. — hpaulj, May 06 '17 at 19:00

MSeifert · Accepted Answer · 2017-05-06T22:28:20.453

It seems you're mostly interested in the difference between your function 3 compared to the pure NumPy (function 1) and Python (function 2) approaches. The answer is quite simple (especially if you look at function 4):

NumPy functions have a "huge" constant factor.

You typically need several thousand elements to get in the regime where the runtime of np.sum actually depends on the number of elements in the array. Using IPython and matplotlib (the plot is at the end of the answer) you can easily check the runtime dependency:

import numpy as np

n = []
timing_sum1 = []
timing_sum2 = []
for i in range(1, 25):
    num = 2**i
    arr = np.arange(num)
    print(num)
    time1 = %timeit -o arr.sum()    # calling the method
    time2 = %timeit -o np.sum(arr)  # calling the function
    n.append(num)
    timing_sum1.append(time1)
    timing_sum2.append(time2)

The results for np.sum (shortened) are quite interesting:

4
22.6 µs ± 297 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
16
25.1 µs ± 1.08 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
64
25.3 µs ± 1.58 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
256
24.1 µs ± 1.48 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
1024
24.6 µs ± 221 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
4096
27.6 µs ± 147 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
16384
40.6 µs ± 1.29 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
65536
91.2 µs ± 1.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
262144
394 µs ± 8.09 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1048576
1.24 ms ± 4.38 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
4194304
4.71 ms ± 22.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
16777216
18.6 ms ± 280 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

It seems the constant factor is roughly 20µs on my computer) and it takes an array with 16384 thousand elements to double that time. So the timing for function 3 and 4 are mostly timing multiplicatives of the constant factor.

In function 3 you include the constant factor 2 times, once with np.sum and once with np.arange. In this case arange is quite cheap because each array is the same size, so NumPy & Python & your OS probably reuse the memory of the array of the last iteration. However even that takes time (roughly 2µs for very small arrays on my computer).

More generally: To identify bottlenecks you should always profile the functions!

I show the results for the functions with line-profiler. Therefore I altered the functions a bit so they only do one operation per line:

import numpy as np

def func1():
    x = np.arange(1000)
    x = x*2
    return np.sum(x)

def func2():
    sum_ = 0
    for i in range(1000):
        tmp = i*2
        sum_ += tmp
    return sum_

def func3():
    sum_ = 0
    for i in range(0, 1000, 4):  # I'm using python3, so "range" is like "xrange"!
        x = np.arange(i, i + 4, 1)
        x = x * 2
        tmp = np.sum(x)
        sum_ += tmp
    return sum_

def func4():
    sum_ = 0
    x = np.arange(1000)
    for i in range(0, 1000, 4):
        y = x[i:i + 4]
        y = y * 2
        tmp = np.sum(y)
        sum_ += tmp
    return sum_

Results:

%load_ext line_profiler

%lprun -f func1 func1()
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     4                                           def func1():
     5         1           62     62.0     23.8      x = np.arange(1000)
     6         1           65     65.0     24.9      x = x*2
     7         1          134    134.0     51.3      return np.sum(x)

%lprun -f func2 func2()
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     9                                           def func2():
    10         1            7      7.0      0.1      sum_ = 0
    11      1001         2523      2.5     30.9      for i in range(1000):
    12      1000         2819      2.8     34.5          tmp = i*2
    13      1000         2819      2.8     34.5          sum_ += tmp
    14         1            3      3.0      0.0      return sum_

%lprun -f func3 func3()
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    16                                           def func3():
    17         1            7      7.0      0.0      sum_ = 0
    18       251          909      3.6      2.9      for i in range(0, 1000, 4):
    19       250         6527     26.1     21.2          x = np.arange(i, i + 4, 1)
    20       250         5615     22.5     18.2          x = x * 2
    21       250        16053     64.2     52.1          tmp = np.sum(x)
    22       250         1720      6.9      5.6          sum_ += tmp
    23         1            3      3.0      0.0      return sum_

%lprun -f func4 func4()
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    25                                           def func4():
    26         1            7      7.0      0.0      sum_ = 0
    27         1           49     49.0      0.2      x = np.arange(1000)
    28       251          892      3.6      3.4      for i in range(0, 1000, 4):
    29       250         2177      8.7      8.3          y = x[i:i + 4]
    30       250         5431     21.7     20.7          y = y * 2
    31       250        15990     64.0     60.9          tmp = np.sum(y)
    32       250         1686      6.7      6.4          sum_ += tmp
    33         1            3      3.0      0.0      return sum_

I won't go into the details of the results, but as you can see np.sum is definetly the bottleneck in func3 and func4. I already guessed that np.sum is the bottleneck before I wrote the answer but these line-profilings actually verify that it is the bottleneck.

Which leads to a very important fact when using NumPy:

Know when to use it! Small arrays aren't worth it (mostly).
Know the NumPy functions and just use them. They already use (if avaiable) compiler optimization flags to unroll loops.

If you really believe some part is too slow then you can use:

NumPy's C API and process the array with C (can be really easy with Cython but you can also do it manually)
Numba (based on LLVM).

But generally you probably can't beat NumPy for moderatly sized (several thousand entries and more) arrays.

Visualization of the timings:

%matplotlib notebook

import matplotlib.pyplot as plt

# Average time per sum-call
fig = plt.figure(1)
ax = plt.subplot(111)
ax.plot(n, [time.average for time in timing_sum1], label='arr.sum()', c='red')
ax.plot(n, [time.average for time in timing_sum2], label='np.sum(arr)', c='blue')
ax.set_xscale('log')
ax.set_yscale('log')
ax.set_xlabel('elements')
ax.set_ylabel('time it takes to sum them [seconds]')
ax.grid(which='both')
ax.legend()

# Average time per element
fig = plt.figure(1)
ax = plt.subplot(111)
ax.plot(n, [time.average / num for num, time in zip(n, timing_sum1)], label='arr.sum()', c='red')
ax.plot(n, [time.average / num for num, time in zip(n, timing_sum2)], label='np.sum(arr)', c='blue')
ax.set_xscale('log')
ax.set_yscale('log')
ax.set_xlabel('elements')
ax.set_ylabel('time per element [seconds / element]')
ax.grid(which='both')
ax.legend()

The plots are log-log, I think it was the best way to visualize the data given that it extends several orders of magnitude (I just hope it's still understandable).

The first plot shows how much time it takes to do the sum:

The second plot shows the average time it takes to do the sum divided by the number of elements in the array. This is just another way to interpret the data:

Thanks a lot for the detailed answer, these suggestions will certainly be useful for me. — Robin Yuen, May 06 '17 at 19:12
Line-profiler would seem to be very useful to me. I will certainly look into that in future questions. Thank you. — Robin Yuen, May 07 '17 at 03:45

Divakar · Answer 2 · 2017-05-06T19:10:28.757

Based on the tests (shown next) it seems, you are beaten by the functional call overhead. Alongwith the vectorized capability of NumPy functions/tools, we need to give it enough data for crunching. With func3, we are giving it just 4 elements per call to np.sum.

Let's investigate the per call overhead for np.sum. Here's np.sum starting with summing of none, one element and onwards -

In [90]: a = np.array([])

In [91]: %timeit np.sum(a)
1000000 loops, best of 3: 1.6 µs per loop

In [61]: a = np.array([0])

In [62]: %timeit np.sum(a)
1000000 loops, best of 3: 1.66 µs per loop

In [63]: a = np.random.randint(0,9,(100))

In [64]: %timeit np.sum(a)
100000 loops, best of 3: 1.79 µs per loop

In [65]: a = np.random.randint(0,9,(1000))

In [66]: %timeit np.sum(a)
100000 loops, best of 3: 2.25 µs per loop

In [67]: a = np.random.randint(0,9,(10000))

In [68]: %timeit np.sum(a)
100000 loops, best of 3: 7.27 µs per loop

and so on.

Thus, we would incur a minimum of around 1.6 u-sec per call to np.sum on the system setup for these tests.

Let's see how the scalar addition with the addition operator performs -

In [98]: def add_nums(a,b):
    ...:     return a+b
    ...: 

In [99]: %timeit add_nums(2,3)
10000000 loops, best of 3: 71.5 ns per loop

This is about 25x faster than the per call overhead for np.sum.

The obvious idea next up is to test out to see how func3 performs with more data crunching given to np.sum.

Modified func3 (the version that uses slicing) to have a variable datasize for per iteration summing :

def func3(scale_factor = 4):
    sum1 = 0
    x = np.arange(0,1000)
    for i in xrange(0,1000,scale_factor):
        sum1 += np.sum(x[i:i+scale_factor]*2)
    return sum1

Starting off with a scale_factor = 4 as used orginally -

In [83]: %timeit func1()
100000 loops, best of 3: 5.39 µs per loop

In [84]: %timeit func2()
10000 loops, best of 3: 39.8 µs per loop

In [85]: %timeit func3(scale_factor = 4)
1000 loops, best of 3: 741 µs per loop

Yes, func3 is slow.

Now, let's give more data per call to np.sum i.e. increase scale_factor -

In [86]: %timeit func3(scale_factor = 8)
1000 loops, best of 3: 376 µs per loop

In [87]: %timeit func3(scale_factor = 20)
10000 loops, best of 3: 152 µs per loop

In [88]: %timeit func3(scale_factor = 100)
10000 loops, best of 3: 33.5 µs per loop

and so on until we feed in the entire data to np.sum for the maximum limit to performance with np.sum and minimum call overhead.

Oh wow, that is a very detailed answer. So I guess as long as the summation takes significantly more time than the function call I can safety partition the tasks and sum it up to reduce memory usage. Thanks a lot! — Robin Yuen, May 06 '17 at 18:53
The `%timeit 1+2` is timing just the time Python needs to load a constant - Python won't do any addition there (see also ["Why is ** with variables so much slower than with values"](http://stackoverflow.com/q/36636196/5393381)). — MSeifert, May 06 '17 at 19:00
@MSeifert Thanks for pointing it out! Edited with a more correct version of addtion. — Divakar, May 06 '17 at 19:04
I guess that the takeaway is to NOT do micro-optimisations. Thanks for the proof ! — Ciprian Tomoiagă, May 06 '17 at 23:11

score 3 · Answer 3 · answered May 06 '17 at 18:11

First of all, nobody would write the third variant in C, because the compiler should do the necessary optimizations.

So take the first one, you have two creations of numpy arrays (arange and *2) and one summation. Creating complex objects like numpy arrays takes some time, but each vector operation is written in C code and very fast.

The second one only uses primitive python operations (about 3000, iteration, multipication and summation), which are written in C and very fast.

Third variant you create about 2 * 250 numpy arrays (a comparably slow operation), which leads to 100 times slower execution speed compared to only creating 2 numpy arrays.

If you have concerns in respect to memory usage, you should use inline operations, which only creates one array:

x = np.arange(1000)
x *= 2
return x.sum()

If you still have to use too much memory, divide your operations in chunks as large as possible.

Thanks for your answer. I wasn't aware of inline operations in numpy. However the code is still a lot slower than func1. I've made a response in my question, please have a look. Thanks a lot! — Robin Yuen, May 06 '17 at 18:33

Performance in different vectorization method in numpy

3 Answers3

Visualization of the timings:

Linked