I am a bit baffled by the following profiling results, and would like to hear some explanation in order to make sense of them. I thought I would take the inner-product as a simple function to compare different possible implementations:
import numpy as np
def iterprod(a,b):
for x,y in zip(a,b):
yield x*y
def dot1(a,b):
return sum([ x*y for x,y in zip(a,b) ])
def dot2(a,b):
return sum(iterprod(a,b))
def dot3(a,b):
return np.dot(a,b)
The first implementation dot1
is a "naive" one, where we first create a new list of pairwise products, and then sum its elements. I thought the second implementation dot2
would be a bit smarter, because it eliminates the need to create a new list. And the third implementation dot3
uses Numpy's dot function.
In order to profile these functions, I am using the following:
import timeit
def showtime( fun, a, b, rep=500 ):
def wrapped():
return fun(a,b)
t = timeit.Timer( wrapped )
print( t.timeit(rep) )
Scenario 1: Python lists
import random
n = 100000
a = [ random.random() for k in range(n) ]
b = [ random.random() for k in range(n) ]
showtime( dot1, a, b )
showtime( dot2, a, b )
showtime( dot3, a, b )
output:
3.883254656990175
3.9970695309893927
2.5059548830031417
So the "smarter" implementation dot2
actually performs worse than the naive one dot1
, and Numpy is much faster than both. But then..
Scenario 2: Python arrays
I thought maybe using a numeric container, like array, would perhaps enable certain optimisations under the hood.
import array
a = array.array( 'd', a )
b = array.array( 'd', b )
showtime( dot1, a, b )
showtime( dot2, a, b )
showtime( dot3, a, b )
output:
4.048957359002088
5.460344396007713
0.005460165994009003
Nope. If anything, it made things worse for the pure-Python implementations, accentuating the difference between the "naive" and "smart" versions, and now Numpy is 3 orders of magnitude faster!
Questions
Q1. The only way I can make sense of these results is if Numpy actually copies the data before processing it in Scenario 1, whereas it only "points" to it in scenario 2, does that sound reasonable?
Q2. Why is my "smart" implementation performing systematically slower than the "naive" one? If my hunch for Q1 is correct, it's entirely possible that it is faster to create a new array if sum
does something smart under the hood. Is that it?
Q3. 3 orders of magnitude! How is that possible? Are my implementations really dumb, or are there some magical processor instructions to compute the dot product?