4

I have two numpy boolean arrays (a and b). I need to find how many of their elements are equal. Currently, I do len(a) - (a ^ b).sum(), but the xor operation creates an entirely new numpy array, as I understand. How do I efficiently implement this desired behavior without creating the unnecessary temporary array?

I've tried using numexpr, but I can't quite get it to work right. It doesn't support the notion that True is 1 and False is 0, so I have to use ne.evaluate("sum(where(a==b, 1, 0))"), which takes about twice as long.

Edit: I forgot to mention that one of these arrays is actually a view into another array of a different size, and both arrays should be considered immutable. Both arrays are 2-dimensional and tend to be somewhere around 25x40 in size.

Yes, this is the bottleneck of my program and is worth optimizing.

Ponkadoodle
  • 5,777
  • 5
  • 38
  • 62
  • I find it unusual that this is your bottleneck, rather than whatever is generating the inputs to this part of your code. Where are `a` and `b` coming from? – user2357112 Jul 31 '13 at 03:56
  • This accounts for roughly 30% of my overall program's execution time. I'm searching for letters within an image. I have 62 small images, and I'm checking to see how many pixels they have in common with the larger image in several places. – Ponkadoodle Jul 31 '13 at 04:03

4 Answers4

2

On my machine this is faster:

(a == b).sum()

If you don't want to use any extra storage, than I would suggest using numba. I'm not too familiar with it, but this seems to work well. I ran into some trouble getting Cython to take a boolean NumPy array.

from numba import autojit
def pysumeq(a, b):
    tot = 0
    for i in xrange(a.shape[0]):
        for j in xrange(a.shape[1]):
            if a[i,j] == b[i,j]:
                tot += 1
    return tot
# make numba version
nbsumeq = autojit(pysumeq)
A = (rand(10,10)<.5)
B = (rand(10,10)<.5)
# do a simple dry run to get it to compile
# for this specific use case
nbsumeq(A, B)

If you don't have numba, I would suggest using the answer by @user2357112

Edit: Just got a Cython version working, here's the .pyx file. I'd go with this.

from numpy cimport ndarray as ar
cimport numpy as np
cimport cython

@cython.boundscheck(False)
@cython.wraparound(False)
def cysumeq(ar[np.uint8_t,ndim=2,cast=True] a, ar[np.uint8_t,ndim=2,cast=True] b):
    cdef int i, j, h=a.shape[0], w=a.shape[1], tot=0
    for i in xrange(h):
        for j in xrange(w):
            if a[i,j] == b[i,j]:
                tot += 1
    return tot
IanH
  • 10,250
  • 1
  • 28
  • 32
  • Thanks for the first suggestion - I had assumed the equality operator would return a single boolean, rather than it being applied to each element. Numba looks like a pain to setup, having to install a custom RTTI-enabled LLVM and everything, which is something I'd prefer not to deal with. – Ponkadoodle Jul 31 '13 at 04:52
  • Yes, it is tricky to set up unless you're already using the Anaconda Python distribution. I would use Cython, but I couldn't get it to work nicely with NumPy boolean arrays. – IanH Jul 31 '13 at 04:56
  • Alright, I just figured out a way to do this with Cython and updated the answer. – IanH Jul 31 '13 at 05:09
  • Wow. The cython solution is over 4 times faster than my original. I'll give a day or two for somebody to try and beat this (which I doubt will happen), and then accept this answer. Thanks for your help! – Ponkadoodle Jul 31 '13 at 06:18
1

To start with you can skip then A*B step:

>>> a
array([ True, False,  True, False,  True], dtype=bool)
>>> b
array([False,  True,  True, False,  True], dtype=bool)
>>> np.sum(~(a^b))
3

If you do not mind destroying array a or b, I am not sure you will get faster then this:

>>> a^=b   #In place xor operator
>>> np.sum(~a)
3
Daniel
  • 19,179
  • 7
  • 60
  • 74
  • 2
    `~a` creates an intermediate array, `np.bitwise_not(a, out=a)` shouldn't. But if the arrays are any long, multiplying two scalars is probably going to be faster than negating an array. – Jaime Jul 31 '13 at 04:01
  • It's worth noting also, that the multiply is actually precomputed anyways. That part was pseudocode. – Ponkadoodle Jul 31 '13 at 04:06
1

If the problem is allocation and deallocation, maintain a single output array and tell numpy to put the results there every time:

out = np.empty_like(a) # Allocate this outside a loop and use it every iteration
num_eq = np.equal(a, b, out).sum()

This'll only work if the inputs are always the same dimensions, though. You may be able to make one big array and slice out a part that's the size you need for each call if the inputs have varying sizes, but I'm not sure how much that slows you down.

user2357112
  • 260,549
  • 28
  • 431
  • 505
  • Do you mean `np.logical_xor`? – Daniel Jul 31 '13 at 03:54
  • Nope. He wants to know how many elements are equal. It's more straightforward to use the function that checks whether things are equal. I did mix up the namespace in the original version, though. – user2357112 Jul 31 '13 at 03:55
  • I may be going completely crazy, but I cannot find a `logic` module for numpy. – Daniel Jul 31 '13 at 03:58
  • That was my mistake; I mixed up how the documentation was organized, so I thought the stuff in [this section](http://docs.scipy.org/doc/numpy/reference/routines.logic.html) was in a module called `numpy.logic`. – user2357112 Jul 31 '13 at 03:59
  • I don't think there's any performance improvement over plain `np.sum(a == b)`, in both cases you are having to allocate and fill an array. – Jaime Jul 31 '13 at 04:03
  • You only have to allocate the array once if you allocate it outside of a loop and use the same array on every iteration. – user2357112 Jul 31 '13 at 04:04
0

Improving upon IanH's answer, it's also possible to get access to the underlying C array in a numpy array from within Cython, by supplying mode="c" to ndarray.

from numpy cimport ndarray as ar
cimport numpy as np
cimport cython

@cython.boundscheck(False)
@cython.wraparound(False)
cdef int cy_sum_eq(ar[np.uint8_t,ndim=2,cast=True,mode="c"] a, ar[np.uint8_t,ndim=2,cast=True,mode="c"] b):
    cdef int i, j, h=a.shape[0], w=a.shape[1], tot=0
    cdef np.uint8_t* adata = &a[0, 0]
    cdef np.uint8_t* bdata = &b[0, 0]
    for i in xrange(h):
        for j in xrange(w):
            if adata[j] == bdata[j]:
                tot += 1
        adata += w
        bdata += w
    return tot

This is about 40% faster on my machine than IanH's Cython version, and I've found that rearranging the loop contents doesn't seem to make much of a difference at this point probably due to compiler optimizations. At this point, one could potentially link to a C function optimized with SSE and such to perform this operation and pass adata and bdata as uint8_t*s

Ponkadoodle
  • 5,777
  • 5
  • 38
  • 62