5

It looks like calculating the cross-product of an array of vectors explicitly is a lot faster than using np.cross. I've tried vector-first and vector-last, it doesn't seem to make a difference, though that was proposed in an answer to a similar question. Am I using it wrong, or is it just slower?

The explicit calculation seems to take about 60ns per cross-product on a laptop. Is that ~roughly~ as fast as it's going to get? In this case, there doesn't seem to be any reason to go to Cython or PyPy or writing a special ufunc yet.

I also see references to the use of einsum, but I don't really understand how to use that, and suspect it is not faster.

a = np.random.random(size=300000).reshape(100000,3) # vector last
b = np.random.random(size=300000).reshape(100000,3)
c, d = a.swapaxes(0, 1),  b.swapaxes(0, 1)          # vector first

def npcross_vlast():        return np.cross(a, b)
def npcross_vfirst():       return np.cross(c, d, axisa=0, axisb=0)
def npcross_vfirst_axisc(): return np.cross(c, d, axisa=0, axisb=0, axisc=0)
def explicitcross_vlast():
    e = np.zeros_like(a)
    e[:,0] = a[:,1]*b[:,2] - a[:,2]*b[:,1]
    e[:,1] = a[:,2]*b[:,0] - a[:,0]*b[:,2]
    e[:,2] = a[:,0]*b[:,1] - a[:,1]*b[:,0]
    return e
def explicitcross_vfirst():
    e = np.zeros_like(c)
    e[0,:] = c[1,:]*d[2,:] - c[2,:]*d[1,:]
    e[1,:] = c[2,:]*d[0,:] - c[0,:]*d[2,:]
    e[2,:] = c[0,:]*d[1,:] - c[1,:]*d[0,:]
    return e
print "explicit"
print timeit.timeit(explicitcross_vlast,  number=10)
print timeit.timeit(explicitcross_vfirst, number=10)
print "np.cross"
print timeit.timeit(npcross_vlast,        number=10)
print timeit.timeit(npcross_vfirst,       number=10)
print timeit.timeit(npcross_vfirst_axisc, number=10)
print all([npcross_vlast()[7,i] == npcross_vfirst()[7,i] ==
           npcross_vfirst_axisc()[i,7] == explicitcross_vlast()[7,i] ==
           explicitcross_vfirst()[i,7] for i in range(3)]) # check one

explicit
0.0582590103149
0.0560920238495
np.cross
0.399816989899
0.412983894348
0.411231040955
True
Community
  • 1
  • 1
uhoh
  • 3,713
  • 6
  • 42
  • 95
  • Look at the code for `np.cross`. It is doing what you are doing, plus some cover to handle size 2 cases, and some axis swapping so it can use expressions like `a[1]*b[2] - a[2]*b[1]`. As long as the large dimension is vectorized, a few explicit steps on the small dimension (size 3) doesn't hurt your speed. – hpaulj Apr 18 '15 at 06:18
  • (one of) my question(s) is: why is np.cross almost 10 times slower, independent of size or order? – uhoh Apr 18 '15 at 06:24
  • 1
    As @Jaime hinted, updating `numpy` will probably solve that issue. I see very similar timeits here on `1.9.2`. – cel Apr 18 '15 at 10:08
  • Earlier cross product speed questions: http://stackoverflow.com/questions/20908754/how-to-speed-up-a-vector-cross-product-calculation?lq=1; http://stackoverflow.com/questions/28132147/can-numpy-einsum-perform-a-cross-product-between-segments-of-a-trajectory; – hpaulj Apr 18 '15 at 14:52
  • 1
    `swapaxes` doesn't do anything for speed, since the memory layout is still the same. `vfirst` is slightly faster if the arrays are generated that way from the start. – hpaulj Apr 18 '15 at 15:15
  • As noted in the answer below, I have only 1.7.1. @jpaulj thanks for pointing out that `swapaxes` is only returning a view. – uhoh Apr 18 '15 at 15:33
  • **note: All three answers here** so far have been very helpful and important to read by anyone looking for the fastest cross product! – uhoh Sep 23 '16 at 14:16

3 Answers3

4

The performance of np.cross improved significantly in the 1.9.x release of numpy.

%timeit explicitcross_vlast()
%timeit explicitcross_vfirst()
%timeit npcross_vlast()
%timeit npcross_vfirst()
%timeit npcross_vfirst_axisc() 

These are the timings I get for 1.8.0

100 loops, best of 3: 4.47 ms per loop
100 loops, best of 3: 4.41 ms per loop
10 loops, best of 3: 29.1 ms per loop
10 loops, best of 3: 29.3 ms per loop
10 loops, best of 3: 30.6 ms per loop

And these the timings for 1.9.0:

100 loops, best of 3: 4.62 ms per loop
100 loops, best of 3: 4.19 ms per loop
100 loops, best of 3: 4.05 ms per loop
100 loops, best of 3: 4.09 ms per loop
100 loops, best of 3: 4.24 ms per loop

I suspect that the speedup was introduced by merge request #4338.

cel
  • 30,017
  • 18
  • 97
  • 117
  • Thank you @cel. Time flies - I am way back at 1.7.1, learned my lesson! – uhoh Apr 18 '15 at 15:37
  • 2
    This used to be the accepted answer - it is in fact the answer I needed (I was using an old version of numpy before `np.cross()` had been improved). But I've switched to the answer of @NicoSchlömer - I think that is the most useful information going forward for people who are searching for the fastest way to implement a cross product find this question. Thanks again for your help! – uhoh Sep 23 '16 at 14:14
2

First off, if you're looking to speed up your code, you should probably try and get rid of cross-products altogether. That's possible in many cases, e.g., when used in connection with dot products <a x b, c x d> = <a, c><b, d> - <a, d><b, c>.

Anyways, in case you really need explicit cross products, check out

eijk = np.zeros((3, 3, 3))
eijk[0, 1, 2] = eijk[1, 2, 0] = eijk[2, 0, 1] = 1
eijk[0, 2, 1] = eijk[2, 1, 0] = eijk[1, 0, 2] = -1

np.einsum('ijk,aj,ak->ai', eijk, a, b)
np.einsum('iak,ak->ai', np.einsum('ijk,aj->iak', eijk, a), b)

These two are equivalent to np.cross, where the second uses two einsums with two arguments each, a technique suggested in a similar question.

The results are disappointing, though: Both of these variants are slower than np.cross (except for tiny n):

enter image description here

The plot was created with

import numpy as np
import perfplot

eijk = np.zeros((3, 3, 3))
eijk[0, 1, 2] = eijk[1, 2, 0] = eijk[2, 0, 1] = 1
eijk[0, 2, 1] = eijk[2, 1, 0] = eijk[1, 0, 2] = -1


b = perfplot.bench(
    setup=lambda n: np.random.rand(2, n, 3),
    n_range=[2 ** k for k in range(23)],
    kernels=[
        lambda X: np.cross(X[0], X[1]),
        lambda X: np.einsum("ijk,aj,ak->ai", eijk, X[0], X[1]),
        lambda X: np.einsum("iak,ak->ai", np.einsum("ijk,aj->iak", eijk, X[0]), X[1]),
    ],
    labels=["np.cross", "einsum", "double einsum"],
    xlabel="len(a)",
)

b.save("out.png")
Nico Schlömer
  • 53,797
  • 27
  • 201
  • 249
  • Very informative plot! When I do Monte Carlo's with a lot of cross products of individual 3-vectors, the `np.einsum()` seems to have a useful advantage at least. Since the slowness noted in my original question was partly because of an old version of numpy before `np.cross()` had been improved in speed, could you note (for the record) which version you've tested? – uhoh Sep 23 '16 at 14:11
1

Simply changing your vlast to

def stacked_vlast(a,b):
        x = a[:,1]*b[:,2] - a[:,2]*b[:,1]
        y = a[:,2]*b[:,0] - a[:,0]*b[:,2]
        z = a[:,0]*b[:,1] - a[:,1]*b[:,0]
        return np.array([x,y,z]).T

i.e. replacing the column assignment with stacking, as the (old) cross does, slows the speed by 5x.

When I use a local copy of the development cross function, I get a minor speed improvement over your explicit_vlast. That cross uses the out parameter in an attempt to cut down on temporary arrays, but my crude tests suggest that it doesn't make much difference in speed.

https://github.com/numpy/numpy/blob/master/numpy/core/numeric.py

If you explicit version works, I wouldn't upgrade numpy just to get this new cross.

hpaulj
  • 221,503
  • 14
  • 230
  • 353