Numpy sum of operator results without allocating an unnecessary array

Question

I have two numpy boolean arrays (a and b). I need to find how many of their elements are equal. Currently, I do len(a) - (a ^ b).sum(), but the xor operation creates an entirely new numpy array, as I understand. How do I efficiently implement this desired behavior without creating the unnecessary temporary array?

I've tried using numexpr, but I can't quite get it to work right. It doesn't support the notion that True is 1 and False is 0, so I have to use ne.evaluate("sum(where(a==b, 1, 0))"), which takes about twice as long.

Edit: I forgot to mention that one of these arrays is actually a view into another array of a different size, and both arrays should be considered immutable. Both arrays are 2-dimensional and tend to be somewhere around 25x40 in size.

Yes, this is the bottleneck of my program and is worth optimizing.

I find it unusual that this is your bottleneck, rather than whatever is generating the inputs to this part of your code. Where are `a` and `b` coming from? — user2357112, Jul 31 '13 at 03:56
This accounts for roughly 30% of my overall program's execution time. I'm searching for letters within an image. I have 62 small images, and I'm checking to see how many pixels they have in common with the larger image in several places. — Ponkadoodle, Jul 31 '13 at 04:03

IanH · Accepted Answer · 2013-07-31T05:54:36.843

2

On my machine this is faster:

(a == b).sum()

If you don't want to use any extra storage, than I would suggest using numba. I'm not too familiar with it, but this seems to work well. I ran into some trouble getting Cython to take a boolean NumPy array.

from numba import autojit
def pysumeq(a, b):
    tot = 0
    for i in xrange(a.shape[0]):
        for j in xrange(a.shape[1]):
            if a[i,j] == b[i,j]:
                tot += 1
    return tot
# make numba version
nbsumeq = autojit(pysumeq)
A = (rand(10,10)<.5)
B = (rand(10,10)<.5)
# do a simple dry run to get it to compile
# for this specific use case
nbsumeq(A, B)

If you don't have numba, I would suggest using the answer by @user2357112

Edit: Just got a Cython version working, here's the .pyx file. I'd go with this.

from numpy cimport ndarray as ar
cimport numpy as np
cimport cython

@cython.boundscheck(False)
@cython.wraparound(False)
def cysumeq(ar[np.uint8_t,ndim=2,cast=True] a, ar[np.uint8_t,ndim=2,cast=True] b):
    cdef int i, j, h=a.shape[0], w=a.shape[1], tot=0
    for i in xrange(h):
        for j in xrange(w):
            if a[i,j] == b[i,j]:
                tot += 1
    return tot

edited Jul 31 '13 at 05:54

answered Jul 31 '13 at 04:08

IanH

10,250
1
28
32

Thanks for the first suggestion - I had assumed the equality operator would return a single boolean, rather than it being applied to each element. Numba looks like a pain to setup, having to install a custom RTTI-enabled LLVM and everything, which is something I'd prefer not to deal with. – Ponkadoodle Jul 31 '13 at 04:52
Yes, it is tricky to set up unless you're already using the Anaconda Python distribution. I would use Cython, but I couldn't get it to work nicely with NumPy boolean arrays. – IanH Jul 31 '13 at 04:56
Alright, I just figured out a way to do this with Cython and updated the answer. – IanH Jul 31 '13 at 05:09
Wow. The cython solution is over 4 times faster than my original. I'll give a day or two for somebody to try and beat this (which I doubt will happen), and then accept this answer. Thanks for your help! – Ponkadoodle Jul 31 '13 at 06:18

Daniel · Answer 2 · 2013-07-31T03:52:35.333

1

To start with you can skip then A*B step:

>>> a
array([ True, False,  True, False,  True], dtype=bool)
>>> b
array([False,  True,  True, False,  True], dtype=bool)
>>> np.sum(~(a^b))
3

If you do not mind destroying array a or b, I am not sure you will get faster then this:

>>> a^=b   #In place xor operator
>>> np.sum(~a)
3

edited Jul 31 '13 at 03:52

answered Jul 31 '13 at 03:45

Daniel

19,179
7
60
74

2

`~a` creates an intermediate array, `np.bitwise_not(a, out=a)` shouldn't. But if the arrays are any long, multiplying two scalars is probably going to be faster than negating an array. – Jaime Jul 31 '13 at 04:01
It's worth noting also, that the multiply is actually precomputed anyways. That part was pseudocode. – Ponkadoodle Jul 31 '13 at 04:06

user2357112 · Answer 3 · 2013-07-31T04:06:22.417

1

If the problem is allocation and deallocation, maintain a single output array and tell numpy to put the results there every time:

out = np.empty_like(a) # Allocate this outside a loop and use it every iteration
num_eq = np.equal(a, b, out).sum()

This'll only work if the inputs are always the same dimensions, though. You may be able to make one big array and slice out a part that's the size you need for each call if the inputs have varying sizes, but I'm not sure how much that slows you down.

edited Jul 31 '13 at 04:06

answered Jul 31 '13 at 03:51

user2357112

260,549
28
431
505

Do you mean `np.logical_xor`? – Daniel Jul 31 '13 at 03:54
Nope. He wants to know how many elements are equal. It's more straightforward to use the function that checks whether things are equal. I did mix up the namespace in the original version, though. – user2357112 Jul 31 '13 at 03:55
I may be going completely crazy, but I cannot find a `logic` module for numpy. – Daniel Jul 31 '13 at 03:58
That was my mistake; I mixed up how the documentation was organized, so I thought the stuff in [this section](http://docs.scipy.org/doc/numpy/reference/routines.logic.html) was in a module called `numpy.logic`. – user2357112 Jul 31 '13 at 03:59
I don't think there's any performance improvement over plain `np.sum(a == b)`, in both cases you are having to allocate and fill an array. – Jaime Jul 31 '13 at 04:03
You only have to allocate the array once if you allocate it outside of a loop and use the same array on every iteration. – user2357112 Jul 31 '13 at 04:04

Ponkadoodle · Answer 4 · 2013-08-07T03:17:27.887

Improving upon IanH's answer, it's also possible to get access to the underlying C array in a numpy array from within Cython, by supplying mode="c" to ndarray.

from numpy cimport ndarray as ar
cimport numpy as np
cimport cython

@cython.boundscheck(False)
@cython.wraparound(False)
cdef int cy_sum_eq(ar[np.uint8_t,ndim=2,cast=True,mode="c"] a, ar[np.uint8_t,ndim=2,cast=True,mode="c"] b):
    cdef int i, j, h=a.shape[0], w=a.shape[1], tot=0
    cdef np.uint8_t* adata = &a[0, 0]
    cdef np.uint8_t* bdata = &b[0, 0]
    for i in xrange(h):
        for j in xrange(w):
            if adata[j] == bdata[j]:
                tot += 1
        adata += w
        bdata += w
    return tot

This is about 40% faster on my machine than IanH's Cython version, and I've found that rearranging the loop contents doesn't seem to make much of a difference at this point probably due to compiler optimizations. At this point, one could potentially link to a C function optimized with SSE and such to perform this operation and pass adata and bdata as uint8_t*s

Numpy sum of operator results without allocating an unnecessary array

4 Answers4