Fast check for NaN in NumPy

Question

I'm looking for the fastest way to check for the occurrence of NaN (np.nan) in a NumPy array X. np.isnan(X) is out of the question, since it builds a boolean array of shape X.shape, which is potentially gigantic.

I tried np.nan in X, but that seems not to work because np.nan != np.nan. Is there a fast and memory-efficient way to do this at all?

(To those who would ask "how gigantic": I can't tell. This is input validation for library code.)

does validating the user input not work in this scenario? As in check for NaN before the insert — Woot4Moo, Jul 18 '11 at 17:14
@Woot4Moo: no, the library takes NumPy arrays or `scipy.sparse` matrices as input. — Fred Foo, Jul 18 '11 at 20:28
If you're doing this a lot, I've heard good things about Bottleneck (http://pypi.python.org/pypi/Bottleneck) — matt, Jul 19 '11 at 17:21

score 203 · Accepted Answer · edited Jan 17 '14 at 18:34

203

Ray's solution is good. However, on my machine it is about 2.5x faster to use numpy.sum in place of numpy.min:

In [13]: %timeit np.isnan(np.min(x))
1000 loops, best of 3: 244 us per loop

In [14]: %timeit np.isnan(np.sum(x))
10000 loops, best of 3: 97.3 us per loop

Unlike min, sum doesn't require branching, which on modern hardware tends to be pretty expensive. This is probably the reason why sum is faster.

edit The above test was performed with a single NaN right in the middle of the array.

It is interesting to note that min is slower in the presence of NaNs than in their absence. It also seems to get slower as NaNs get closer to the start of the array. On the other hand, sum's throughput seems constant regardless of whether there are NaNs and where they're located:

In [40]: x = np.random.rand(100000)

In [41]: %timeit np.isnan(np.min(x))
10000 loops, best of 3: 153 us per loop

In [42]: %timeit np.isnan(np.sum(x))
10000 loops, best of 3: 95.9 us per loop

In [43]: x[50000] = np.nan

In [44]: %timeit np.isnan(np.min(x))
1000 loops, best of 3: 239 us per loop

In [45]: %timeit np.isnan(np.sum(x))
10000 loops, best of 3: 95.8 us per loop

In [46]: x[0] = np.nan

In [47]: %timeit np.isnan(np.min(x))
1000 loops, best of 3: 326 us per loop

In [48]: %timeit np.isnan(np.sum(x))
10000 loops, best of 3: 95.9 us per loop

edited Jan 17 '14 at 18:34

Uli Köhler

13,012
16
70
120

answered Jul 18 '11 at 17:42

NPE

486,780
108
951
1,012

1

`np.min` is faster when the array contains no NaNs, which is my expected input. But I've decided to accept this one anyway, because it catches `inf` and `neginf` as well. – Fred Foo Jul 18 '11 at 20:27
4

This only catches `inf` or `-inf` if the input contains both, and it has problems if the input contains large but finite values that overflow when added together. – user2357112 Aug 19 '13 at 19:28
5

min and max does not need to branch for floating point data on sse capable x86 chips. So as of numpy 1.8 min will not be slower than sum, on my amd phenom its even 20% faster. – jtaylor Jun 14 '14 at 14:54
1

On my Intel Core i5, with numpy 1.9.2 on OSX, `np.sum` is still about 30% faster than `np.min`. – Matthew Brett Jul 30 '15 at 09:43
`np.isnan(x).any(0)` is slightly faster than `np.sum` and `np.min` on my machine, although there might be some unwanted caching. – jsignell Apr 29 '16 at 18:44
Or you can use sum() instead of np.sum , which can be much much faster. (in my case 1/10th) – dreab Nov 10 '16 at 14:18
Additional information for people wondering why `sum()` is faster: https://stackoverflow.com/q/22392185/974555 – gerrit Jul 25 '17 at 23:35

score 39 · Answer 2 · answered Jul 18 '11 at 17:17

39

I think np.isnan(np.min(X)) should do what you want.

answered Jul 18 '11 at 17:17

Ray

4,531
1
23
32

Hmmm... this is always O(n) when is could be O(1) (for some arrays). – user48956 Oct 18 '17 at 22:01

Nico Schlömer · Answer 3 · 2021-08-18T14:27:07.420

There are two general approaches here:

Check each array item for nan and take any.
Apply some cumulative operation that preserves nans (like sum) and check its result.

While the first approach is certainly the cleanest, the heavy optimization of some of the cumulative operations (particularly the ones that are executed in BLAS, like dot) can make those quite fast. Note that dot, like some other BLAS operations, are multithreaded under certain conditions. This explains the difference in speed between different machines.

import numpy as np
import perfplot


def min(a):
    return np.isnan(np.min(a))


def sum(a):
    return np.isnan(np.sum(a))


def dot(a):
    return np.isnan(np.dot(a, a))


def any(a):
    return np.any(np.isnan(a))


def einsum(a):
    return np.isnan(np.einsum("i->", a))


b = perfplot.bench(
    setup=np.random.rand,
    kernels=[min, sum, dot, any, einsum],
    n_range=[2 ** k for k in range(25)],
    xlabel="len(a)",
)
b.save("out.png")
b.show()

score 18 · Answer 4 · answered Jul 18 '11 at 21:19

18

Even there exist an accepted answer, I'll like to demonstrate the following (with Python 2.7.2 and Numpy 1.6.0 on Vista):

In []: x= rand(1e5)
In []: %timeit isnan(x.min())
10000 loops, best of 3: 200 us per loop
In []: %timeit isnan(x.sum())
10000 loops, best of 3: 169 us per loop
In []: %timeit isnan(dot(x, x))
10000 loops, best of 3: 134 us per loop

In []: x[5e4]= NaN
In []: %timeit isnan(x.min())
100 loops, best of 3: 4.47 ms per loop
In []: %timeit isnan(x.sum())
100 loops, best of 3: 6.44 ms per loop
In []: %timeit isnan(dot(x, x))
10000 loops, best of 3: 138 us per loop

Thus, the really efficient way might be heavily dependent on the operating system. Anyway dot(.) based seems to be the most stable one.

answered Jul 18 '11 at 21:19

eat

7,440
1
19
27

1

I suspect it depends not so much on the OS, as on the underlying BLAS implementation and C compiler. Thanks, but a dot product is just a tad more likely to overflow when `x` contains large values, and I also want to check for inf. – Fred Foo Jul 19 '11 at 06:08
2

Well, you can always do the dot product with ones and use `isfinite(.)`. I just wanted to point out the huge performance gap. Thanks – eat Jul 19 '11 at 07:46
The same on my machine. – kawing-chiu Sep 01 '16 at 01:44
2

**Clever, no?** As [Fred Foo](https://stackoverflow.com/questions/6736590/fast-check-for-nan-in-numpy#comment7991035_6739580) suggests, any efficiency gains of the dot product-based approach are almost certainly thanks to a local NumPy installation linked against an optimized BLAS implementation like ATLAS, MKL, or OpenBLAS. This is the case for Anaconda, for example. Given that, this dot product will be parallelized across *all* available cores. The same *cannot* be said for the `min`- or `sum`-based approaches, which run confined to a single core. Ergo, that performance gap. – Cecil Curry Feb 03 '18 at 06:47

MSeifert · Answer 5 · 2016-12-02T11:23:32.070

If you're comfortable with numba it allows to create a fast short-circuit (stops as soon as a NaN is found) function:

import numba as nb
import math

@nb.njit
def anynan(array):
    array = array.ravel()
    for i in range(array.size):
        if math.isnan(array[i]):
            return True
    return False

If there is no NaN the function might actually be slower than np.min, I think that's because np.min uses multiprocessing for large arrays:

import numpy as np
array = np.random.random(2000000)

%timeit anynan(array)          # 100 loops, best of 3: 2.21 ms per loop
%timeit np.isnan(array.sum())  # 100 loops, best of 3: 4.45 ms per loop
%timeit np.isnan(array.min())  # 1000 loops, best of 3: 1.64 ms per loop

But in case there is a NaN in the array, especially if it's position is at low indices, then it's much faster:

array = np.random.random(2000000)
array[100] = np.nan

%timeit anynan(array)          # 1000000 loops, best of 3: 1.93 µs per loop
%timeit np.isnan(array.sum())  # 100 loops, best of 3: 4.57 ms per loop
%timeit np.isnan(array.min())  # 1000 loops, best of 3: 1.65 ms per loop

Similar results may be achieved with Cython or a C extension, these are a bit more complicated (or easily avaiable as bottleneck.anynan) but ultimatly do the same as my anynan function.

score 5 · Answer 6 · edited Jun 20 '20 at 09:12

5

use .any()

if numpy.isnan(myarray).any()
numpy.isfinite maybe better than isnan for checking

if not np.isfinite(prop).all()

edited Jun 20 '20 at 09:12

Community

1
1

answered Sep 06 '17 at 05:39

woso

271
4
6

vitiral · Answer 7 · 2015-07-01T15:59:23.827

1

Related to this is the question of how to find the first occurrence of NaN. This is the fastest way to handle that that I know of:

index = next((i for (i,n) in enumerate(iterable) if n!=n), None)

edited Jul 01 '15 at 15:59

answered Nov 07 '13 at 21:00

vitiral

8,446
8
29
43

score 0 · Answer 8 · answered Aug 12 '21 at 23:21

Adding to @nico-schlömer and @mseifert 's answers, I computed the performance of a numba-test has_nan with early stops, compared to some of the functions that will parse the full array.

On my machine, for an array without nans, the break-even happens for ~10^4 elements.


import perfplot
import numpy as np
import numba
import math

def min(a):
    return np.isnan(np.min(a))

def dot(a):
    return np.isnan(np.dot(a, a))

def einsum(a):
    return np.isnan(np.einsum("i->", a))

@numba.njit
def has_nan(a):
    for i in range(a.size - 1):
        if math.isnan(a[i]):
            return True
    return False


def array_with_missing_values(n, p):
    """ Return array of size n,  p : nans ( % of array length )
    Ex : n=1e6, p=1 : 1e4 nan assigned at random positions """
    a = np.random.rand(n)
    p = np.random.randint(0, len(a), int(p*len(a)/100))
    a[p] = np.nan
    return a


#%%
perfplot.show(
    setup=lambda n: array_with_missing_values(n, 0),
    kernels=[min, dot, has_nan],
    n_range=[2 ** k for k in range(20)],
    logx=True,
    logy=True,
    xlabel="len(a)",
)

What happens if the array has nans ? I investigated the impact of the nan-coverage of the array.

For arrays of length 1,000,000, has_nan becomes a better option is there are ~10^-3 % nans (so ~10 nans) in the array.


#%%
N = 1000000  # 100000
perfplot.show(
    setup=lambda p: array_with_missing_values(N, p),
    kernels=[min, dot, has_nan],
    n_range=np.array([2 ** k for k in range(20)]) / 2**20 * 0.01, 
    logy=True,
    xlabel=f"% of nan in array (N = {N})",
)

If in your application most arrays have nan and you're looking for ones without, then has_nan is the best approach. Else; dot seems to be the best option.

Fast check for NaN in NumPy

8 Answers8

Linked