4

I have two 1 dimensional numpy vectors va and vb which are being used to populate a matrix by passing all pair combinations to a function.

na = len(va)
nb = len(vb)
D = np.zeros((na, nb))
for i in range(na):
    for j in range(nb):
        D[i, j] = foo(va[i], vb[j])

As it stands, this piece of code takes a very long time to run due to the fact that va and vb are relatively large (4626 and 737). However I am hoping this can be improved due to the fact that a similiar procedure is performed using the cdist method from scipy with very good performance.

D = cdist(va, vb, metric)

I am obviously aware that scipy has the benefit of running this piece of code in C rather than in python - but I'm hoping there is some numpy function im unaware of that can execute this quickly.

Michael Aquilina
  • 5,352
  • 4
  • 33
  • 38
  • Use vectorized functions which will process all elements of va and vb at once using Numpy's `outer` functions... Or passing a meshgrid of va and vb ... – Saullo G. P. Castro Oct 31 '14 at 11:03
  • Problem is that this function is custom and changing to be vectorized is non trivial. I tried using `np.meshgrid` and then using `np.vectorize` but the performance improvements were minimal. – Michael Aquilina Oct 31 '14 at 11:17
  • 1
    `vectorize` does not do anything to the internals of your `foo`. It's just a wrapper than ends up calling `foo(a,b)` with each pair of scalars. It's a convenience, not a speedup tool. – hpaulj Oct 31 '14 at 20:36

3 Answers3

5

One of the least known numpy functions for what the docs call functional programming routines is np.frompyfunc. This creates a numpy ufunc from a Python function. Not some other object that closely simulates a numpy ufunc, but a proper ufunc with all its bells and whistles. While the behavior is in many aspects very similar to np.vectorize, it has some distinct advantages, that hopefully the following code should highlight:

In [2]: def f(a, b):
   ...:     return a + b
   ...:

In [3]: f_vec = np.vectorize(f)

In [4]: f_ufunc = np.frompyfunc(f, 2, 1)  # 2 inputs, 1 output

In [5]: a = np.random.rand(1000)

In [6]: b = np.random.rand(2000)

In [7]: %timeit np.add.outer(a, b)  # a baseline for comparison
100 loops, best of 3: 9.89 ms per loop

In [8]: %timeit f_vec(a[:, None], b)  # 50x slower than np.add
1 loops, best of 3: 488 ms per loop

In [9]: %timeit f_ufunc(a[:, None], b)  # ~20% faster than np.vectorize...
1 loops, best of 3: 425 ms per loop

In [10]: %timeit f_ufunc.outer(a, b)  # ...and you get to use ufunc methods
1 loops, best of 3: 427 ms per loop

So while it is still clearly inferior to a properly vectorized implementation, it is a little faster (the looping is in C, but you still have the Python function call overhead).

Jaime
  • 65,696
  • 17
  • 124
  • 159
3

cdist is fast because it is written in highly-optimized C code (as you already pointed out), and it only supports a small predefined set of metrics.

Since you want to apply the operation generically, to any given foo function, you have no choice but to call that function na-times-nb times. That part is not likely to be further optimizable.

What's left to optimize are the loops and the indexing. Some suggestions to try out:

  1. Use xrange instead of range (if in python2.x. in python3, range is already a generator-like)
  2. Use enumerate, instead of range + explicitly indexing
  3. Use a python speed "magic", such as cython or numba, to speed up the looping process.

If you can make further assumptions about foo, it might be possible to speed it up further.

shx2
  • 61,779
  • 13
  • 130
  • 153
3

Like @shx2 said, it all depends on what is foo. If you can express it in terms of numpy ufuncs, then use outer method:

In [11]: N = 400

In [12]: B = np.empty((N, N))

In [13]: x = np.random.random(N)

In [14]: y = np.random.random(N)

In [15]: %%timeit
for i in range(N):
   for j in range(N):
     B[i, j] = x[i] - y[j]
   ....: 
10 loops, best of 3: 87.2 ms per loop

In [16]: %timeit A = np.subtract.outer(x, y)   # <--- np.subtract is a ufunc
1000 loops, best of 3: 294 µs per loop

Otherwise you can push the looping down to cython level. Continuing a trivial example above:

In [45]: %%cython
cimport cython
@cython.boundscheck(False)
@cython.wraparound(False)
def foo(double[::1] x, double[::1] y, double[:, ::1] out):
    cdef int i, j
    for i in xrange(x.shape[0]):
        for j in xrange(y.shape[0]):
            out[i, j] = x[i] - y[j]
   ....: 

In [46]: foo(x, y, B)

In [47]: np.allclose(B, np.subtract.outer(x, y))
Out[47]: True

In [48]: %timeit foo(x, y, B)
10000 loops, best of 3: 149 µs per loop

The cython example is deliberately made overly simplistic: in reality you might want to add some shape/stride checks, allocate the memory within your function etc.

ev-br
  • 24,968
  • 9
  • 65
  • 78
  • Well I have two cases of this code being run. One is calculating the jaccard distance between two vectors. I am aware scipy has a function for this but the behaviour is slightly different in our case. The other is performing a lookup in the dictionary `D[i, j] = a.get((va[i], vb[j]), 1.0)` – Michael Aquilina Oct 31 '14 at 14:20
  • 1
    Well, it depends and needs to be measured for your particular setup -- for which `%timeit` is your friend. If you get stuck somewhere en route, best ask a separate question with details. Good luck! – ev-br Oct 31 '14 at 14:42