22

Is there an idiomatic way to compare two NumPy arrays that would treat NaNs as being equal to each other (but not equal to anything other than a NaN).

For example, I want the following two arrays to compare equal:

np.array([1.0, np.NAN, 2.0])
np.array([1.0, np.NAN, 2.0])

and the following two arrays to compare unequal:

np.array([1.0, np.NAN, 2.0])
np.array([1.0, 0.0, 2.0])

I am looking for a method that would produce a scalar Boolean outcome.

The following would do it:

np.all((a == b) | (np.isnan(a) & np.isnan(b)))

but it's clunky and creates all those intermediate arrays.

Is there a way that's easier on the eye and makes better use of memory?

P.S. If it helps, the arrays are known to have the same shape and dtype.

sega_sai
  • 8,328
  • 1
  • 29
  • 38
NPE
  • 486,780
  • 108
  • 951
  • 1,012
  • 1
    @DanielRoseman: I understand that. I've got two methods of producing a NumPy array, and I need to know whether they've produced identical arrays. – NPE May 30 '12 at 15:51
  • 1
    You've ruled out one answer from [this question](http://stackoverflow.com/q/10710328/577088); are you ruling out the other two as well? – senderle May 30 '12 at 16:01
  • @senderle: Thanks for the pointer. That question didn't show up in my search. However, all of those suggestions are either verbose or make very poor use of memory (or both). :-( – NPE May 30 '12 at 16:05
  • @aix, I agree :) Just wanted to draw your attention to it. The `testing.assert_equal` approach is almost good, except that it presumably fails if `__debug__` is False! – senderle May 30 '12 at 16:08
  • 2
    If you're using the current git tip for numpy, there's an [`numpy.isclose` function](https://github.com/numpy/numpy/blob/master/numpy/core/numeric.py#L2039) that takes an `equal_nan` keyword argument (which defaults to `False` for compatibility). It's not terribly memory-friendly, though. – Joe Kington May 30 '12 at 16:10
  • 2
    If it weren't for numbers which compare equal but have different binary representations (0.0 and -0.0, e.g.) then memoryview(a0) == memoryview(a1) would do it.. – DSM May 30 '12 at 16:30
  • 1
    @DSM: Thank you for this. It might actually fit the bill for my use case. Would you mind writing it up as an answer? – NPE May 30 '12 at 16:38
  • Have you looked at http://stackoverflow.com/questions/10710328/comparing-numpy-arrays-containing-nan/10710390 – JoshAdel May 30 '12 at 17:20
  • @JoshAdel: Yes. Please see my earlier comment addressed to senderle. – NPE May 30 '12 at 17:22
  • Does this answer your question? [comparing numpy arrays containing NaN](https://stackoverflow.com/questions/10710328/comparing-numpy-arrays-containing-nan) – iacob Mar 24 '21 at 22:53

4 Answers4

18

If you really care about memory use (e.g. have very large arrays), then you should use numexpr and the following expression will work for you:

np.all(numexpr.evaluate('(a==b)|((a!=a)&(b!=b))'))

I've tested it on very big arrays with length of 3e8, and the code has the same performance on my machine as

np.all(a==b)

and uses the same amount of memory

sega_sai
  • 8,328
  • 1
  • 29
  • 38
9

Numpy 1.10 added the equal_nan keyword to np.allclose (https://docs.scipy.org/doc/numpy/reference/generated/numpy.allclose.html).

So you can do now:

In [24]: np.allclose(np.array([1.0, np.NAN, 2.0]), 
                     np.array([1.0, np.NAN, 2.0]), equal_nan=True)
Out[24]: True
joris
  • 133,120
  • 36
  • 247
  • 202
  • This does not work with strings, by the way. Comparing arrays with strings will throw: `TypeError("ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''")` – Ian Dec 19 '18 at 16:49
8

Disclaimer: I don't recommend this for regular use, and I wouldn't use it myself, but I could imagine rare circumstances under which it might be useful.

If the arrays have the same shape and dtype, you could consider using the low-level memoryview:

>>> import numpy as np
>>> 
>>> a0 = np.array([1.0, np.NAN, 2.0])
>>> ac = a0 * (1+0j)
>>> b0 = np.array([1.0, np.NAN, 2.0])
>>> b1 = np.array([1.0, np.NAN, 2.0, np.NAN])
>>> c0 = np.array([1.0, 0.0, 2.0])
>>> 
>>> memoryview(a0)
<memory at 0x85ba1bc>
>>> memoryview(a0) == memoryview(a0)
True
>>> memoryview(a0) == memoryview(ac) # equal but different dtype
False
>>> memoryview(a0) == memoryview(b0) # hooray!
True
>>> memoryview(a0) == memoryview(b1)
False
>>> memoryview(a0) == memoryview(c0)
False

But beware of subtle problems like this:

>>> zp = np.array([0.0])
>>> zm = -1*zp
>>> zp
array([ 0.])
>>> zm
array([-0.])
>>> zp == zm
array([ True], dtype=bool)
>>> memoryview(zp) == memoryview(zm)
False

which happens because the binary representations differ even though they compare equal (they have to, of course: that's how it knows to print the negative sign)

>>> memoryview(zp)[0]
'\x00\x00\x00\x00\x00\x00\x00\x00'
>>> memoryview(zm)[0]
'\x00\x00\x00\x00\x00\x00\x00\x80'

On the bright side, it short-circuits the way you might hope it would:

In [47]: a0 = np.arange(10**7)*1.0
In [48]: a0[-1] = np.NAN    
In [49]: b0 = np.arange(10**7)*1.0    
In [50]: b0[-1] = np.NAN     
In [51]: timeit memoryview(a0) == memoryview(b0)
10 loops, best of 3: 31.7 ms per loop
In [52]: c0 = np.arange(10**7)*1.0    
In [53]: c0[0] = np.NAN   
In [54]: d0 = np.arange(10**7)*1.0    
In [55]: d0[0] = 0.0    
In [56]: timeit memoryview(c0) == memoryview(d0)
100000 loops, best of 3: 2.51 us per loop

and for comparison:

In [57]: timeit np.all((a0 == b0) | (np.isnan(a0) & np.isnan(b0)))
1 loops, best of 3: 296 ms per loop
In [58]: timeit np.all((c0 == d0) | (np.isnan(c0) & np.isnan(d0)))
1 loops, best of 3: 284 ms per loop
DSM
  • 342,061
  • 65
  • 592
  • 494
  • (+1) This is great, thanks for taking the time to write it up. – NPE May 30 '12 at 17:19
  • @aix: I've actually needed something similar in the past (equal-considering-nans-equal), though performance and memory weren't issues so I did it manually. Might be worth making a feature request. – DSM May 30 '12 at 17:27
0

Not sure this is any better, but a thought...

import numpy
class FloatOrNaN(numpy.float_):
    def __eq__(self, other):
        return (numpy.isnan(self) and numpy.isnan(other)) or super(FloatOrNaN,self).__eq__(other)

a = [1., np.nan, 2.]
one = numpy.array([FloatOrNaN(val) for val in a], dtype=object)
two = numpy.array([FloatOrNaN(val) for val in a], dtype=object)
print one == two   # yields  array([ True,  True,  True], dtype=bool)

This pushes the ugliness into the dtype, at the expense of making the inner workings python instead of c (Cython/etc would fix this). It does, however, greatly reduce memory costs.

Still kinda ugly though :(

Ethan Coon
  • 751
  • 5
  • 16