7

Consider two ndarrays of length n, arr1 and arr2. I'm computing the following sum of products, and doing it num_runs times to benchmark:

import numpy as np
import time

num_runs = 1000
n = 100

arr1 = np.random.rand(n)
arr2 = np.random.rand(n)

start_comp = time.clock()
for r in xrange(num_runs):
    sum_prods = np.sum( [arr1[i]*arr2[j] for i in xrange(n) 
                         for j in xrange(i+1, n)] )

print "total time for comprehension = ", time.clock() - start_comp

start_loop = time.clock()
for r in xrange(num_runs):
    sum_prod = 0.0
    for i in xrange(n):
        for j in xrange(i+1, n):
            sum_prod += arr1[i]*arr2[j]

print "total time for loop = ", time.clock() - start_loop

The output is

total time for comprehension = 3.23097066953
total time for comprehension = 3.9045544426

so using list comprehension appears faster.

Is there a much more efficient implementation, using Numpy routines perhaps, to calculate such a sum of products?

bcf
  • 2,104
  • 1
  • 24
  • 43
  • Could this be useful? https://stackoverflow.com/questions/9068478/how-to-parallelize-a-sum-calculation-in-python-numpy – Frank Bryce May 24 '16 at 14:51
  • Seems very relevant : [`Matrix multiplication with iterator dependency - NumPy`](http://stackoverflow.com/questions/36045510/matrix-multiplication-with-iterator-dependency-numpy). – Divakar May 24 '16 at 15:24

3 Answers3

12

Rearrange the operation into an O(n) runtime algorithm instead of O(n^2), and take advantage of NumPy for the products and sums:

# arr1_weights[i] is the sum of all terms arr1[i] gets multiplied by in the
# original version
arr1_weights = arr2[::-1].cumsum()[::-1] - arr2

sum_prods = arr1.dot(arr1_weights)

Timing shows this to be about 200 times faster than the list comprehension for n == 100.

In [21]: %%timeit
   ....: np.sum([arr1[i] * arr2[j] for i in range(n) for j in range(i+1, n)])
   ....: 
100 loops, best of 3: 5.13 ms per loop

In [22]: %%timeit
   ....: arr1_weights = arr2[::-1].cumsum()[::-1] - arr2
   ....: sum_prods = arr1.dot(arr1_weights)
   ....: 
10000 loops, best of 3: 22.8 µs per loop
user2357112
  • 260,549
  • 28
  • 431
  • 505
8

A vectorized way : np.sum(np.triu(np.multiply.outer(arr1,arr2),1)).

for a 30x improvement:

In [9]: %timeit np.sum(np.triu(np.multiply.outer(arr1,arr2),1))
1000 loops, best of 3: 272 µs per loop

In [10]: %timeit np.sum( [arr1[i]*arr2[j] for i in range(n) 
                         for j in range(i+1, n)]
100 loops, best of 3: 7.9 ms per loop

In [11]: allclose(np.sum(np.triu(np.multiply.outer(arr1,arr2),1)),
np.sum(np.triu(np.multiply.outer(arr1,arr2),1)))
Out[11]: True

Another fast approch is to use numba :

from numba import jit
@jit
def t(arr1,arr2):
    s=0
    for i in range(n):
        for j in range(i+1,n):
            s+= arr1[i]*arr2[j]
    return s

for a 10x new factor :

In [12]: %timeit t(arr1,arr2)
10000 loops, best of 3: 21.1 µs per loop

And using @user2357112 minimal answer,

@jit
def t2357112(arr1,arr2):
    s=0
    c=0
    for i in range(n-2,-1,-1):
        c += arr2[i+1]
        s += arr1[i]*c
    return s

for

In [13]: %timeit t2357112(arr1,arr2)
100000 loops, best of 3: 2.33 µs per loop

, just doing the necessary operations.

B. M.
  • 18,243
  • 2
  • 35
  • 54
  • The numba solution is nice since because it doesn't involve any intermediate array creation – JoshAdel May 24 '16 at 15:17
  • I think you might have gotten the bounds wrong when translating my code to numba. You seem to be trying to access `arr2[n]`, past the last element of `arr2`. – user2357112 May 24 '16 at 16:25
  • you are right. Since numba don't check bounds, it gave silently the right result on my try ......edited. – B. M. May 24 '16 at 21:10
3

You can use the following broadcasting trick:

a = np.sum(np.triu(arr1[:,None]*arr2[None,:],1))
b = np.sum( [arr1[i]*arr2[j] for i in xrange(n) for j in xrange(i+1, n)] )
print a == b  # True

Basically, I'm paying the price of calculating the product of all elements pairwise in arr1 and arr2 to take advantage of the speed of numpy broadcasting/vectorization being done much faster in low-level code.

And timings:

%timeit np.sum(np.triu(arr1[:,None]*arr2[None,:],1))
10000 loops, best of 3: 55.9 µs per loop

%timeit np.sum( [arr1[i]*arr2[j] for i in xrange(n) for j in xrange(i+1, n)] )
1000 loops, best of 3: 1.45 ms per loop
JoshAdel
  • 66,734
  • 27
  • 141
  • 140
  • Or to save on memory, though a bit slower could be : `R,C = np.triu_indices(n,1); output = np.dot(arr1[R],arr2[C])`. – Divakar May 24 '16 at 15:25