Explaining large runtime variations

Question

I am a bit baffled by the following profiling results, and would like to hear some explanation in order to make sense of them. I thought I would take the inner-product as a simple function to compare different possible implementations:

import numpy as np

def iterprod(a,b):
    for x,y in zip(a,b):
        yield x*y

def dot1(a,b):
    return sum([ x*y for x,y in zip(a,b) ])

def dot2(a,b):
    return sum(iterprod(a,b))

def dot3(a,b):
    return np.dot(a,b)

The first implementation dot1 is a "naive" one, where we first create a new list of pairwise products, and then sum its elements. I thought the second implementation dot2 would be a bit smarter, because it eliminates the need to create a new list. And the third implementation dot3 uses Numpy's dot function.

In order to profile these functions, I am using the following:

import timeit

def showtime( fun, a, b, rep=500 ):
    def wrapped():
        return fun(a,b)
    t = timeit.Timer( wrapped )
    print( t.timeit(rep) )

Scenario 1: Python lists

import random
n = 100000

a = [ random.random() for k in range(n) ]
b = [ random.random() for k in range(n) ]

showtime( dot1, a, b )
showtime( dot2, a, b )
showtime( dot3, a, b )

output:

3.883254656990175
3.9970695309893927
2.5059548830031417

So the "smarter" implementation dot2 actually performs worse than the naive one dot1, and Numpy is much faster than both. But then..

Scenario 2: Python arrays

I thought maybe using a numeric container, like array, would perhaps enable certain optimisations under the hood.

import array
a = array.array( 'd', a )
b = array.array( 'd', b )

showtime( dot1, a, b )
showtime( dot2, a, b )
showtime( dot3, a, b )

output:

4.048957359002088
5.460344396007713
0.005460165994009003

Nope. If anything, it made things worse for the pure-Python implementations, accentuating the difference between the "naive" and "smart" versions, and now Numpy is 3 orders of magnitude faster!

Questions

Q1. The only way I can make sense of these results is if Numpy actually copies the data before processing it in Scenario 1, whereas it only "points" to it in scenario 2, does that sound reasonable?

Q2. Why is my "smart" implementation performing systematically slower than the "naive" one? If my hunch for Q1 is correct, it's entirely possible that it is faster to create a new array if sum does something smart under the hood. Is that it?

Q3. 3 orders of magnitude! How is that possible? Are my implementations really dumb, or are there some magical processor instructions to compute the dot product?

you can likely make version 1 faster by removing the `[ ]` around the list comprehension to turn it into a generator instead — Azsgy, Apr 05 '18 at 18:12
@Azsgy It basically becomes `dot2` then; I just tested what you said, and it gives similar runtimes. It actually performs _worse_ than with `[]`. :) — Jonathan H, Apr 05 '18 at 18:14
Interesting, that doesn't make any sense with my understanding. It has to create the extra list after all — Azsgy, Apr 05 '18 at 18:15
Take a look at https://stackoverflow.com/questions/993984/why-numpy-instead-of-python-lists as well — user2699, Apr 05 '18 at 18:58

9000 · Answer 1 · 2018-04-05T18:23:51.863

3

The generators / yield mechanics does cost some CPU cycles. What it saves you is memory when you don't want the whole sequence at once, or helps when you want to interleave several dependent computations to lower your latency aka time to the first item in the sequence.

Using a numpy function on an array just lets it run regular C code over a contiguous chunk of memory, without dereferencing float objects from pointers in a list. So it becomes insanely fast (which is the whole point of numpy).

edited Apr 05 '18 at 18:23

answered Apr 05 '18 at 18:15

9000

39,899
9
66
104

3

It also worth mentioning that using Numpy's native arrays is the best choice for large datasets. – Mazdak Apr 05 '18 at 18:24

Azsgy · Accepted Answer · 2018-04-05T18:59:48.150

Q1

Python lists contain pointers to python objects, while arrays contain those numbers directly. The underlying numpy code however expects it to be a contiguous array. So when passed a list, it has to read the value out of the float into a new array for every element of the list.

As noted in the comments, using numpy's built-in arrays is even better.

Q2

Getting a value from a generator is (from memory) slighly less expensive as a python function call. This is a lot more expensive than the list comprehension, where everything but the x*y is handled inside of the interpreter. A list has to be produced, which is expensive, but this cost appears to be quickly ammortized.

Q3

Numpy is three orders of magnitued faster because it is built on very highly optimized low-level libraries. Depending on the backend used, it might even use multiple threads. Python has to deal with a lot of overhead to make things easier for you the programmer at every step of the way, so it really isn't a fair race.

Bonus

I made my generator suggestion because usually constructing the list is significant overhead. However, in this case it is moot as it appears that sum() over a list is faster than over an iterator.

Explaining large runtime variations

Scenario 1: Python lists

Scenario 2: Python arrays

Questions

2 Answers2

Q1

Q2

Q3

Bonus