5

I am wondering if anyone knows some of the key differences between the parakeet and the Numba jit? I am curious, because I was comparing Numexpr to Numba and parakeet, and for this particular expression (which I expected to perform very very well on Numexpr, because it was the one that is mentioned in its documentation)

So the results are

enter image description here

and the functions I tested (via timeit - minimum of 3 repetitions and 10 loops per function)

import numpy as np
import numexpr as ne
from numba import jit as numba_jit
from parakeet import jit as para_jit


def numpy_complex_expr(A, B):
    return(A*B-4.1*A > 2.5*B)

def numexpr_complex_expr(A, B):
    return ne.evaluate('A*B-4.1*A > 2.5*B')

@numba_jit
def numba_complex_expr(A, B):
    return A*B-4.1*A > 2.5*B

@para_jit
def parakeet_complex_expr(A, B):
    return A*B-4.1*A > 2.5*B

I you can also grab the IPython nb if you'd like to double-check the results on your machine.

If someone is wondering if Numba is installed correctly... I think so, it performed as expected in my previous benchmark:

enter image description here

  • 2
    I think for Numba to work you must avoid array operations and write out everything (at least the bottleneck in the code) using for-loops –  May 21 '14 at 09:53

1 Answers1

5

As of the current release of Numba (which you are using in your tests), there is incomplete support for ufuncs with the @jit function. On the other hand you can use @vectorize and it faster:

import numpy as np
from numba import jit, vectorize
import numexpr as ne

def numpy_complex_expr(A, B):
    return(A*B+4.1*A > 2.5*B)

def numexpr_complex_expr(A, B):
    return ne.evaluate('A*B+4.1*A > 2.5*B')

@jit
def numba_complex_expr(A, B):
    return A*B+4.1*A > 2.5*B

@vectorize(['u1(float64, float64)'])
def numba_vec(A,B):
    return A*B+4.1*A > 2.5*B

n = 1000
A = np.random.rand(n,n)
B = np.random.rand(n,n)

Timing results:

%timeit numba_complex_expr(A,B)
1 loops, best of 3: 49.8 ms per loop

%timeit numpy_complex_expr(A,B)
10 loops, best of 3: 43.5 ms per loop

%timeit numexpr_complex_expr(A,B)
100 loops, best of 3: 3.08 ms per loop

%timeit numba_vec(A,B)
100 loops, best of 3: 9.8 ms per loop

If you want to leverage numba to its fullest, then you want to unroll any vectorized operations:

@jit
def numba_unroll2(A, B):
    C = np.empty(A.shape, dtype=np.uint8)
    for i in xrange(A.shape[0]):
        for j in xrange(A.shape[1]):
            C[i,j] = A[i,j]*B[i,j] + 4.1*A[i,j] > 2.5*B[i,j]

    return C

%timeit numba_unroll2(A,B)
100 loops, best of 3: 5.96 ms per loop

Also note that if you set the number of threads that numexpr uses to 1, then you'll see that its main speed advantage is that it's parallelized:

ne.set_num_threads(1)
%timeit numexpr_complex_expr(A,B)
100 loops, best of 3: 8.87 ms per loop

By default numexpr uses ne.detect_number_of_cores() as the number of threads. For my original timing on my machine, it was using 8.

JoshAdel
  • 66,734
  • 27
  • 141
  • 140
  • Thanks a lot, I will it a try and post the results later. One question: Do you know how it differs from `from numbapro import float64` `from numbapro import guvectorize ` http://docs.continuum.io/numbapro/generalizedufuncs.html –  May 21 '14 at 14:45
  • Are you sure about the 'unrolled' version? When I ran it for n=100 and compare it to the vectorized version, it is pretty slow: n = 100 1 loops, best of 3: 286 ms per loop 10000 loops, best of 3: 115 µs per loop. However, I implemented it also in Cython and it is indeed pretty fast using memoryviews on the NumPy arrays –  May 21 '14 at 15:53
  • On my system, when I use n=100, numpy, numexpr and unrolled numba are all about the same and vectorized numba is about 2x slower. – JoshAdel May 21 '14 at 16:08
  • Okay thanks! I will give it a try on a different system ... my current results if you are interested (without unrolled numba because of the slow performance): http://nbviewer.ipython.org/github/rasbt/One-Python-benchmark-per-day/blob/master/ipython_nbs/day7_2_jit_numpy.ipynb?create=1#Results –  May 21 '14 at 16:47