14

I'm comparing Python accelerators (Numba, Cython, f2py) to simple For loops and Numpy's einsum for a particular problem (see below). So far Numpy is the fastest for this problem (factor 6x faster), but I wanted some feedback if there are additional optimizations I should try, or if I'm doing something wrong. This simple code is based on a larger code that has a number of these einsum calls, but no explicit for loops. I'm checking if any of these accelerators can do better.

Timings done with Python 2.7.9 on Mac OS X Yosemite, with gcc-5.3.0 installed (--with-fortran --without-multilib) from Homebrew. Also did %timeit calls; these single call timings are fairly accurate.

In [1]: %run -i test_numba.py
test_numpy: 0.0805640220642
Matches Numpy output: True

test_dumb: 1.43043899536
Matches Numpy output: True

test_numba: 0.464295864105
Matches Numpy output: True

test_cython: 0.627640008926
Matches Numpy output: True

test_f2py: 5.01890516281
Matches Numpy output: True

test_f2py_order: 2.31424307823
Matches Numpy output: True

test_f2py_reorder: 0.507861852646
Matches Numpy output: True

The main code:

import numpy as np
import numba
import time
import test_f2py as tf2py
import pyximport
pyximport.install(setup_args={'include_dirs':np.get_include()})
import test_cython as tcyth

def test_dumb(f,b):
    fnew = np.empty((f.shape[1],f.shape[2]))
    for i in range(f.shape[0]):
        for l in range(f.shape[3]):
            fnew += f[i,:,:,l] * b[i,l]
    return fnew


def test_dumber(f,b):
    fnew = np.empty((f.shape[1],f.shape[2]))
    for i in range(f.shape[0]):
        for j in range(f.shape[1]):
            for k in range(f.shape[2]):
                for l in range(f.shape[3]):
                    fnew[j,k] += f[i,j,k,l] * b[i,l]
    return fnew

@numba.jit(nopython=True)
def test_numba(f,b):
    fnew = np.zeros((f.shape[1],f.shape[2])) #NOTE: can't be empty, gives errors
    for i in range(f.shape[0]):
        for j in range(f.shape[1]):
            for k in range(f.shape[2]):
                for l in range(f.shape[3]):
                    fnew[j,k] += f[i,j,k,l] * b[i,l]
    return fnew

def test_numpy(f,b):
    return np.einsum('i...k,ik->...',f,b)

def test_f2py(f,b):
    return tf2py.test_f2py(f,b)

def test_f2py_order(f,b):
    return tf2py.test_f2py(f,b)

def test_f2py_reorder(f,b):
    return tf2py.test_f2py_reorder(f,b)

def test_cython(f,b):
    return tcyth.test_cython(f,b)

if __name__ == '__main__':

    #goal is to create: fnew = sum f*b over dim 0 and 3.
    f = np.random.rand(32,33,2000,64)
    b = np.random.rand(32,64)

    f1 = np.asfortranarray(f)
    b1 = np.asfortranarray(b)

    f2 = np.asfortranarray(np.transpose(f,[1,2,0,3]))

    funcs = [test_dumb,test_numba, test_cython, \
            test_f2py,test_f2py_order,test_f2py_reorder]

    tstart = time.time()    
    fnew_numpy= test_numpy(f,b)
    tstop = time.time()
    print test_numpy.__name__+': '+str(tstop-tstart)
    print 'Matches Numpy output: '+str(np.allclose(fnew_numpy,fnew_numpy))
    print ''

    for func in funcs:
        tstart = time.time()
        if func.__name__ == 'test_f2py_order':
            fnew = func(f1,b1)
        elif func.__name__ == 'test_f2py_reorder':
            fnew = func(f2,b1)
        else:
            fnew = func(f,b)
        tstop = time.time()
        print func.__name__+': '+str(tstop-tstart)
        print 'Matches Numpy output: '+str(np.allclose(fnew,fnew_numpy))
        print ''

The f2py file (compiled with f2py -c -m test_f2py test_f2py.F90):

!file: test_f2py
subroutine test_f2py(f,b,fnew,n1,n2,n3,n4)

integer  :: n1,n2,n3,n4
real(8), dimension(n1,n2,n3,n4) :: f
real(8), dimension(n1,n4) :: b
real(8), dimension(n2,n3) :: fnew
!f2py intent(in) f
!f2py intent(in) b
!f2py intent(out) fnew
!f2py intent(in) n1
!f2py intent(in) n2
!f2py intent(in) n3
!f2py intent(in) n4

integer :: i1,i2,i3,i4

do i1=1,n1
    do i2=1,n2
        do i3=1,n3
            do i4=1,n4
                fnew(i2,i3) = fnew(i2,i3) + f(i1,i2,i3,i4)*b(i1,i4)
            enddo
        enddo
    enddo
enddo

end subroutine test_f2py

subroutine test_f2py_reorder(f,b,fnew,n1,n2,n3,n4)

integer  :: n1,n2,n3,n4
real(8), dimension(n1,n2,n3,n4) :: f
real(8), dimension(n3,n4) :: b
real(8), dimension(n1,n2) :: fnew
!f2py intent(in) f
!f2py intent(in) b
!f2py intent(out) fnew
!f2py intent(in) n1
!f2py intent(in) n2
!f2py intent(in) n3
!f2py intent(in) n4

integer :: i1,i2,i3,i4

do i3=1,n3
    do i4=1,n4
        do i1=1,n1
            do i2=1,n2
                fnew(i1,i2) = fnew(i1,i2) + f(i1,i2,i3,i4)*b(i3,i4)
            enddo
        enddo
    enddo
enddo

end subroutine test_f2py_reorder

And the Cython .pyx file (compiled with pyximport in the main routine):

#/usr/bin python
import numpy as np
cimport numpy as np

def test_cython(np.ndarray[np.float64_t,ndim=4] f, np.ndarray[np.float64_t,ndim=2] b):
    # cdef np.ndarray[np.float64_t,ndim=4] f
    # cdef np.ndarray[np.float64_t,ndim=2] b
    cdef np.ndarray[np.float64_t,ndim=2] fnew = np.empty((f.shape[1],f.shape[2]),dtype=np.float64)
    cdef int i,j,k,l
    cdef int Ni = f.shape[0]
    cdef int Nj = f.shape[1]
    cdef int Nk = f.shape[2]
    cdef int Nl = f.shape[3]

    for i in range(Ni):
        for j in range(Nj):
            for k in range(Nk):
                for l in range(Nl):
                    fnew[j,k] += f[i,j,k,l] * b[i,l]
    return fnew
Michael
  • 486
  • 6
  • 19
  • Since you already have working code, your question might be better suited to [CodeReview.SE](http://codereview.stackexchange.com/) – ali_m Jan 29 '16 at 18:40
  • 2
    On my laptop (OSX 10.9.5) running Numba 0.23.1 `test_numpy()` takes 75.5 ms per loop using `%timeit` and `test_numba()` takes 123 ms per loop, so the difference doesn't seem as extreme as in your test. You want to be especially careful when benchmarking numba code that you call it once to actually jit the code outside of the benchmark, otherwise you'll include that cost in your numbers, whereas every subsequent call will be much faster. – JoshAdel Jan 29 '16 at 19:31

2 Answers2

11

Normally these accelerators are used to speed up code with Python loops or a lot of intermediary results, whereas einsum is already pretty well optimized (see source). You shouldn't expect that they easily beat einsum, but you might get close to it in performance.

For Numba it is important to exclude the compilation time from the benchmark. This can be accomplished simply by running the jitted function twice (with the same type of inputs). E.g. with IPython I get:

f = np.random.rand(32,33,500,64)
b = np.random.rand(32,64)

%time _ = test_numba(f,b)  # First invocation
# Wall time: 466 ms
%time _ = test_numba(f,b)
# Wall time: 73 ms
%timeit test_numba(f, b)
# 10 loops, best of 3: 72.7 ms per loop
%timeit test_numpy(f, b)
# 10 loops, best of 3: 62.8 ms per loop

For your Cython code a number of improvements can be made:

  1. Disable checks for array bounds and wraparound, see compiler directives.
  2. Specify that the arrays are contiguous.
  3. Use typed memoryviews.

Something like:

cimport cython
import numpy as np

@cython.boundscheck(False)
@cython.wraparound(False)
def test_cython(double[:,:,:,::1] f, double[:,::1] b):
    cdef int i, j, k, l, Ni, Nj, Nk, Nl
    Ni = f.shape[0]
    Nj = f.shape[1]
    Nk = f.shape[2]
    Nl = f.shape[3]

    fnew = np.empty((Nj, Nk))
    cdef double[:,::1] fnew_v = fnew

    for i in range(Ni):
        for j in range(Nj):
            for k in range(Nk):
                for l in range(Nl):
                    fnew_v[j,k] += f[i,j,k,l] * b[i,l]
    return fnew

On an up-to-date Ubuntu 15.10 (x86) this gives me the same speed as einsum. However, on Windows (x86) on the same PC with the Anaconda distribution this Cython code is about half the speed of einsum. I think this may have to do with gcc versions (5.2.1 vs 4.7.0) and the ability to insert SSE instructions (einsum is coded with SSE2 intrinsics). Maybe supplying different compiler options would help, but I'm not sure.

I hardly know any Fortran so I can't comment on that.

Since your goal is to beat einsum I think the obvious next step is to look at increasing parallelism. It should be fairly easy to spawn some threads with cython.parallel. If that doesn't saturate your systems memory bandwidth yet, then you could try to explicitly include the newest CPU instructions like AVX2 and Fused Multiply-Add.

Another thing you could try is to reorder and reshape f and do your operation with np.dot. If your Numpy comes with a good BLAS library, this should enable pretty much every optimization you can think of, though at the cost of a loss of generality and maybe a very expensive copy of the f array.

1

Once it's done parsing the the string parameter, einsum uses a compiled version of nditer to perform a sum-of-products calculation across all axes. The source code is easily found on the numpy github.

A while back I worked out an einsum work-alike as part writing a patch. As part of that I wrote a cython script that does the sum-of-product. You can see this code at:

https://github.com/hpaulj/numpy-einsum

I didn't try to get my code to run at einsum speed. I was just trying to understand how it worked.

hpaulj
  • 221,503
  • 14
  • 230
  • 353