Why this numba code is 6x slower than numpy code?

Question

Is there any reason why the following code run in 2s,

def euclidean_distance_square(x1, x2):
    return -2*np.dot(x1, x2.T) + np.expand_dims(np.sum(np.square(x1), axis=1), axis=1) + np.sum(np.square(x2), axis=1)

while the following numba code run in 12s?

@jit(nopython=True)
def euclidean_distance_square(x1, x2):
   return -2*np.dot(x1, x2.T) + np.expand_dims(np.sum(np.square(x1), axis=1), axis=1) + np.sum(np.square(x2), axis=1)

My x1 is a matrix of dimension (1, 512) and x2 is a matrix of dimension (3000000, 512). It is quite weird that numba can be so much slower. Am I using it wrong?

I really need to speed this up because I need to run this function 3 million times and 2s is still way too slow.

I need to run this on CPU because as you can see the dimension of x2 is so huge, it cannot be loaded onto a GPU (or at least my GPU), not enough memory.

It could be a matter of system configuration (e.g. your `numpy` taking advantage of your GPGPU by using OpenCL) — Basile Starynkevitch, Jun 02 '18 at 16:23
@BasileStarynkevitch it is not possible to run on GPU because of memory issue. Shouldn't numba speeds up even on CPU? — , Jun 02 '18 at 16:25
Numba documentation states that it's pure python while numpy uses a lot of C, I'm guessing that's the biggest efficiency difference — Ofer Sadan, Jun 02 '18 at 16:26
@OferSadan so Numba only speeds up non-numpy code? the documentation seems to suggest that it should speed up numpy code too. Do you have any suggestion on how I can speed this up? — , Jun 02 '18 at 16:27
It's just a guess but I suspect numba is cancelling some of the speed advantage of numpy in this way. I don't know of any implementation of vectors and matrices that is faster than numpy — Ofer Sadan, Jun 02 '18 at 16:42
`np.dot` is already using fast C libraries. `numba` can't improve on that step. It's the large size of `x2` that's taking up time. I'd experiment with`np.dot(x2, x1.T)`. Does `x1` have to be (1,512)? Why not (512,)? Regardless I don't think `sum(x1)` doesn't need `expand_dims`. That sum is just one value. — hpaulj, Jun 02 '18 at 16:44
@hpaulj optimizing x1 wouldn't help much since it is already small. for my implementation, (512,) will give error. Is it reasonable that this function takes 2 sec for the input dimensions of x1 and x2 mentioned above? — , Jun 02 '18 at 17:06
Please avoid editing in follow-up questions in the original question. In most cases you're probably better off asking a new question because it doesn't convolute the original question (by invalidating existing answers or becoming too broad to accurately answer it) and a new question with a *different* scope will attract new (and specific) answers. Especially because you're now asking about a generalized approach which has completely different problems like memory-overhead due to broadcasting and how to arrange the loops to maximize cache hits. — MSeifert, Jun 04 '18 at 07:34
@MSeifert Ok. I reposted here: https://stackoverflow.com/questions/50675705/is-it-expected-for-numbas-efficient-square-euclidean-distance-code-to-be-slower. — , Jun 04 '18 at 07:49
@user10102398 Which type of anwer would you expect? (The code shown above isn't a good idea in both Numba and Numpy) — max9111, Aug 06 '18 at 09:18

MSeifert · Answer 1 · 2018-06-03T00:58:05.213

It is quite weird that numba can be so much slower.

It's not too weird. When you call NumPy functions inside a numba function you call the numba-version of these functions. These can be faster, slower or just as fast as the NumPy versions. You might be lucky or you can be unlucky (you were unlucky!). But even in the numba function you still create lots of temporaries because you use the NumPy functions (one temporary array for the dot result, one for each square and sum, one for the dot plus first sum) so you don't take advantage of the possibilities with numba.

Am I using it wrong?

Essentially: Yes.

I really need to speed this up

Okay, I'll give it a try.

Let's start with unrolling the sum of squares along axis 1 calls:

import numba as nb

@nb.njit
def sum_squares_2d_array_along_axis1(arr):
    res = np.empty(arr.shape[0], dtype=arr.dtype)
    for o_idx in range(arr.shape[0]):
        sum_ = 0
        for i_idx in range(arr.shape[1]):
            sum_ += arr[o_idx, i_idx] * arr[o_idx, i_idx]
        res[o_idx] = sum_
    return res


@nb.njit
def euclidean_distance_square_numba_v1(x1, x2):
    return -2 * np.dot(x1, x2.T) + np.expand_dims(sum_squares_2d_array_along_axis1(x1), axis=1) + sum_squares_2d_array_along_axis1(x2)

On my computer that's already 2 times faster than the NumPy code and almost 10 times faster than your original Numba code.

Speaking from experience getting it 2x faster than NumPy is generally the limit (at least if the NumPy version isn't needlessly complicated or inefficient), however you can squeeze out a bit more by unrolling everything:

import numba as nb

@nb.njit
def euclidean_distance_square_numba_v2(x1, x2):
    f1 = 0.
    for i_idx in range(x1.shape[1]):
        f1 += x1[0, i_idx] * x1[0, i_idx]

    res = np.empty(x2.shape[0], dtype=x2.dtype)
    for o_idx in range(x2.shape[0]):
        val = 0
        for i_idx in range(x2.shape[1]):
            val_from_x2 = x2[o_idx, i_idx]
            val += (-2) * x1[0, i_idx] * val_from_x2 + val_from_x2 * val_from_x2
        val += f1
        res[o_idx] = val
    return res

But that only gives a ~10-20% improvement over the latest approach.

At that point you might realize that you can simplify the code (even though it probably won't speed it up):

import numba as nb

@nb.njit
def euclidean_distance_square_numba_v3(x1, x2):
    res = np.empty(x2.shape[0], dtype=x2.dtype)
    for o_idx in range(x2.shape[0]):
        val = 0
        for i_idx in range(x2.shape[1]):
            tmp = x1[0, i_idx] - x2[o_idx, i_idx]
            val += tmp * tmp
        res[o_idx] = val
    return res

Yeah, that look pretty straight-forward and it's not really slower.

However in all the excitement I forgot to mention the obvious solution: scipy.spatial.distance.cdist which has a sqeuclidean (squared euclidean distance) option:

from scipy.spatial import distance
distance.cdist(x1, x2, metric='sqeuclidean')

It's not really faster than numba but it's available without having to write your own function...

Tests

Test for correctness and do the warmups:

x1 = np.array([[1.,2,3]])
x2 = np.array([[1.,2,3], [2,3,4], [3,4,5], [4,5,6], [5,6,7]])

res1 = euclidean_distance_square(x1, x2)
res2 = euclidean_distance_square_numba_original(x1, x2)
res3 = euclidean_distance_square_numba_v1(x1, x2)
res4 = euclidean_distance_square_numba_v2(x1, x2)
res5 = euclidean_distance_square_numba_v3(x1, x2)
np.testing.assert_array_equal(res1, res2)
np.testing.assert_array_equal(res1, res3)
np.testing.assert_array_equal(res1[0], res4)
np.testing.assert_array_equal(res1[0], res5)
np.testing.assert_almost_equal(res1, distance.cdist(x1, x2, metric='sqeuclidean'))

Timings:

x1 = np.random.random((1, 512))
x2 = np.random.random((1000000, 512))

%timeit euclidean_distance_square(x1, x2)
# 2.09 s ± 54.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit euclidean_distance_square_numba_original(x1, x2)
# 10.9 s ± 158 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit euclidean_distance_square_numba_v1(x1, x2)
# 907 ms ± 7.11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit euclidean_distance_square_numba_v2(x1, x2)
# 715 ms ± 15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit euclidean_distance_square_numba_v3(x1, x2)
# 731 ms ± 34.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit distance.cdist(x1, x2, metric='sqeuclidean')
# 706 ms ± 4.99 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Note: If you have arrays of integers you might want to change the hard-coded 0.0 in the numba functions to 0.

Hmmm... it is weird that my scipy distance function is actually 2x slower in my test at about 4s. Can I know if you have compile scipy with special options? — , Jun 03 '18 at 04:28
@user2675516 What dtype do your arrays have? It's possible that for certain dtypes the scipy functions are a bit slower - but that's just a guess. It could also be that you're using an old version of scipy. — MSeifert, Jun 04 '18 at 06:10
I don't think you can (or should) recompile scipy. That's a bit tricky... but if you really want here are [the official instructions](https://scipy.github.io/devdocs/building/index.html). — MSeifert, Jun 04 '18 at 06:10
I found the culprit, I am using float32, but scipy.distance.cdist is slow on that. It is fast only on float64 — , Jun 04 '18 at 06:42
@user2675516 Yeah, I suspected something like that. I think it could be worthwhile to open an issue on the scipy bug tracker. — MSeifert, Jun 04 '18 at 06:54

score 13 · Answer 2 · answered Jun 02 '18 at 20:42

Despite the fact, that the answer of @MSeifert makes this answer quite obsolete, I'm still posting it, because it explains in more detail why the numba-version was slower than the numpy-version.

As we will see, the main culprit are the different memory access patterns for numpy and numba.

We can reproduce the behavior with much a simpler function:

import numpy as np
import numba as nb

def just_sum(x2):
    return np.sum(x2, axis=1)

@nb.jit('double[:](double[:, :])', nopython=True)
def nb_just_sum(x2):
    return np.sum(x2, axis=1)

x2=np.random.random((2048,2048))

And now the timings:

>>> %timeit just_sum(x)
2.33 ms ± 71.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit nb_just_sum(x)
33.7 ms ± 296 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

that means numpy is about 15 times faster!

When compiling the numba code with annotations (e.g. numba --annotate-html sum.html numba_sum.py) we can see, how the sum is performed by numba (see the whole listing of the summation in the appendix):

initialize the result-column
add the whole first column to the result-column
add the whole second column to the result-column
and so on

What is the problem of this approach? The memory layout! The array is stored in the row-major-order and thus reading it column-wise leads to much more cache-misses than reading it row-wise (which is what the numpy does). There is a great article which explains the possible cache effects.

As we can see, the sum-implementation of numba is yet not very mature. However, from the above consideration the numba-implementation could be competitive for column-major-order (i.e. transposed matrix):

>>> %timeit just_sum(x.T)
3.09 ms ± 66.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit nb_just_sum(x.T)
3.58 ms ± 45.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

and it really is.

As the code of @MSeifert has shown, the main advantage of numba is, that with its help we can reduce the number of temporary numpy-arrays. However, some things that looks easy are not easy at all and a naive solution can be pretty bad. Building a sum is such an operation - one should not think that a simple loop is good enough - see for example this question.

Listing numba-summation:

 Function name: array_sum_impl_axis
in file: /home/ed/anaconda3/lib/python3.6/site-packages/numba/targets/arraymath.py
with signature: (array(float64, 2d, A), int64) -> array(float64, 1d, C)
show numba IR
194:    def array_sum_impl_axis(arr, axis):
195:        ndim = arr.ndim
196:    
197:        if not is_axis_const:
198:            # Catch where axis is negative or greater than 3.
199:            if axis < 0 or axis > 3:
200:                raise ValueError("Numba does not support sum with axis"
201:                                 "parameter outside the range 0 to 3.")
202:    
203:        # Catch the case where the user misspecifies the axis to be
204:        # more than the number of the array's dimensions.
205:        if axis >= ndim:
206:            raise ValueError("axis is out of bounds for array")
207:    
208:        # Convert the shape of the input array to a list.
209:        ashape = list(arr.shape)
210:        # Get the length of the axis dimension.
211:        axis_len = ashape[axis]
212:        # Remove the axis dimension from the list of dimensional lengths.
213:        ashape.pop(axis)
214:        # Convert this shape list back to a tuple using above intrinsic.
215:        ashape_without_axis = _create_tuple_result_shape(ashape, arr.shape)
216:        # Tuple needed here to create output array with correct size.
217:        result = np.full(ashape_without_axis, zero, type(zero))
218:    
219:        # Iterate through the axis dimension.
220:        for axis_index in range(axis_len):
221:            if is_axis_const:
222:                # constant specialized version works for any valid axis value
223:                index_tuple_generic = _gen_index_tuple(arr.shape, axis_index,
224:                                                       const_axis_val)
225:                result += arr[index_tuple_generic]
226:            else:
227:                # Generate a tuple used to index the input array.
228:                # The tuple is ":" in all dimensions except the axis
229:                # dimension where it is "axis_index".
230:                if axis == 0:
231:                    index_tuple1 = _gen_index_tuple(arr.shape, axis_index, 0)
232:                    result += arr[index_tuple1]
233:                elif axis == 1:
234:                    index_tuple2 = _gen_index_tuple(arr.shape, axis_index, 1)
235:                    result += arr[index_tuple2]
236:                elif axis == 2:
237:                    index_tuple3 = _gen_index_tuple(arr.shape, axis_index, 2)
238:                    result += arr[index_tuple3]
239:                elif axis == 3:
240:                    index_tuple4 = _gen_index_tuple(arr.shape, axis_index, 3)
241:                    result += arr[index_tuple4]
242:    
243:        return result

I like that you mentioned that the naive implementation may not be as "correct" as the library function. It's often unnecessary - but in the very few cases where it does matter that could lead to subtle (and hard to track down) problems with the result. It's important to know that NumPy also uses an inexact summation, it's just less "incorrect" because it uses pairwise summation (or at least an unrolled partial summation). If really high accuracy is needed one probably should use [Kahan or Neumaier summation](https://en.wikipedia.org/wiki/Kahan_summation_algorithm) — MSeifert, Jun 03 '18 at 00:56
It may not be that relevant here, but using @nb.jit('double[:](double[:, :])', nopython=True) (declaring potentially non-contiguous arrays) often breaks up SIMD- vectorization. You can use the automatic type detection or declare a C (double[:,::1]) or Fortran (double[::1,:] contigous array. — max9111, Jun 04 '18 at 04:47
@max9111 In this particular case there was no difference, but it is good to know! — ead, Jun 05 '18 at 03:45

score 8 · Answer 3 · edited Aug 07 '18 at 13:54

This is a comment to @MSeifert answer. There are some more things to gain performance. As in every numerical code it is recommendable to think about which datatype is precise enough for your problem. Often float32 is also enough, sometimes even float64 isn't enough.

I want also to mention the fastmath keyword here, which can give another 1.7x speed up here.

[Edit]

For a simple summation I looked into the LLVM-code and found that the sumation was splitted up in partial sums on vectorization. (4 partial sums for double and 8 for float using AVX2). This has to be investigated further.

Code

import llvmlite.binding as llvm
llvm.set_option('', '--debug-only=loop-vectorize')

@nb.njit
def euclidean_distance_square_numba_v3(x1, x2):
    res = np.empty(x2.shape[0], dtype=x2.dtype)
    for o_idx in range(x2.shape[0]):
        val = 0
        for i_idx in range(x2.shape[1]):
            tmp = x1[0, i_idx] - x2[o_idx, i_idx]
            val += tmp * tmp
        res[o_idx] = val
    return res

@nb.njit(fastmath=True)
def euclidean_distance_square_numba_v4(x1, x2):
    res = np.empty(x2.shape[0], dtype=x2.dtype)
    for o_idx in range(x2.shape[0]):
        val = 0.
        for i_idx in range(x2.shape[1]):
            tmp = x1[0, i_idx] - x2[o_idx, i_idx]
            val += tmp * tmp
        res[o_idx] = val
    return res

@nb.njit(fastmath=True,parallel=True)
def euclidean_distance_square_numba_v5(x1, x2):
    res = np.empty(x2.shape[0], dtype=x2.dtype)
    for o_idx in nb.prange(x2.shape[0]):
        val = 0.
        for i_idx in range(x2.shape[1]):
            tmp = x1[0, i_idx] - x2[o_idx, i_idx]
            val += tmp * tmp
        res[o_idx] = val
    return res

Timings

float64
x1 = np.random.random((1, 512))
x2 = np.random.random((1000000, 512))

0.42 v3 @MSeifert
0.25 v4
0.18 v5 parallel-version
0.48 distance.cdist

float32
x1 = np.random.random((1, 512)).astype(np.float32)
x2 = np.random.random((1000000, 512)).astype(np.float32)

0.09 v5

How to explicitly declare types

In general I wouldn't recommend this. Your input arrays can be C-contigous (as the testdata) Fortran contigous or strided. If you know that your data is always C-contiguos you can write

@nb.njit('double[:](double[:, ::1],double[:, ::1])',fastmath=True)
def euclidean_distance_square_numba_v6(x1, x2):
    res = np.empty(x2.shape[0], dtype=x2.dtype)
    for o_idx in range(x2.shape[0]):
        val = 0.
        for i_idx in range(x2.shape[1]):
            tmp = x1[0, i_idx] - x2[o_idx, i_idx]
            val += tmp * tmp
        res[o_idx] = val
    return res

This offers the same performance than the v4 version, but will fail if the input arrays are not C-contigous or not of dtype=np.float64.

You can also use

@nb.njit('double[:](double[:, :],double[:, :])',fastmath=True)
def euclidean_distance_square_numba_v7(x1, x2):
    res = np.empty(x2.shape[0], dtype=x2.dtype)
    for o_idx in range(x2.shape[0]):
        val = 0.
        for i_idx in range(x2.shape[1]):
            tmp = x1[0, i_idx] - x2[o_idx, i_idx]
            val += tmp * tmp
        res[o_idx] = val
    return res

This will also work on strided arrays, but will be much slower than the version above on C-contigous arrays. (0.66s vs. 0.25s). Please note also that your problem is quite limited by memory bandwidth. The difference can be higher with CPU bound calculations.

If you let do Numba the job for you, it will be automaticly detected if the array is contigous or not (providing contgous input data on the first try and than non contigous data, will lead to a recompilation)

Is there a typo in your answer? Your timing for float32 is slower than float64? Numpy default is float64. So when you don't give it a dtype, it is float64 and not 32 — , Jun 04 '18 at 06:43
Sorry I made a mistake with copying the code... The float32 version is twice as fast than the float64 version. — max9111, Jun 04 '18 at 06:47
Good point about `fastmath` - however I would be hesitant to state that it increases precision. That depends heavily on the specific operations and in general it reduces precision (at least compared to IEEE 754 math). I also tested parallel and it was actually a bit slower (because it's memory-bandwith-limited) so I find it really interesting that it's faster in your tests. Maybe that's because of fastmath or maybe different cache speeds? — MSeifert, Jun 04 '18 at 07:10
Out of curiosity: How did you do the benchmarks? Also with `%timeit`? — MSeifert, Jun 04 '18 at 07:10
@max9111 I updated the post. I modified the code a bit so that it can handle (m, n)-dimension x1. Not sure if I am doing it correctly. Can you help to verify? It is still kinda slow. — , Jun 04 '18 at 07:22
No I tested it with a loop over 20 iterations and devided the result by 20 (with a warmup beforehand) I had some problems with timeit when measuring performance with memory bound problems, where the test data fits in cache. (In such cases I use at least two sets of equally shaped testdata to get rid of cache effects. (eg. getting 100 GB/s bandwidth with 30 GB/s memory bandwidth) Regarding precision I should be more clear and update my answer. I also did this with Numba 0.39 dev. On earlier versions the SIMD-vectorization often doesn't work when using parallelization) — max9111, Jun 04 '18 at 07:22
@user2675516 Nice approach! I googled a bit and found a quite similiar approach. http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.euclidean_distances.html If you can use BLAS calls you should use it. With a lot of work eg. https://en.wikipedia.org/wiki/Loop_nest_optimization#Tiling_size you can come quite close, but trying to beat a sophisticated BLAS algorithm doesn't make sense and usually ends in frustration ;). — max9111, Jun 04 '18 at 07:55

Why this numba code is 6x slower than numpy code?

3 Answers3

Tests

Linked