Counting the number of non-NaN elements in a numpy ndarray in Python

Question

I need to calculate the number of non-NaN elements in a numpy ndarray matrix. How would one efficiently do this in Python? Here is my simple code for achieving this:

import numpy as np

def numberOfNonNans(data):
    count = 0
    for i in data:
        if not np.isnan(i):
            count += 1
    return count

Is there a built-in function for this in numpy? Efficiency is important because I'm doing Big Data analysis.

Thnx for any help!

This question appears to be off-topic because it belongs on http://codereview.stackexchange.com — jonrsharpe, Feb 14 '14 at 11:28
+1 I was thinking about CPU time, but yeah why not memory as well. The faster and cheaper the better =) — jjepsuomi, Feb 14 '14 at 11:36
@jjepsuomi A memory efficient version wil be `sum(not np.isnan(x) for x in a)`, but in terms of speed it is slow compared to @M4rtini numpy version. — Ashwini Chaudhary, Feb 14 '14 at 11:40
@AshwiniChaudhary Thank you very much! I need to see which one is more important in my application =) — jjepsuomi, Feb 14 '14 at 11:41

M4rtini · Accepted Answer · 2014-02-14T14:14:08.063

240

np.count_nonzero(~np.isnan(data))

~ inverts the boolean matrix returned from np.isnan.

np.count_nonzero counts values that is not 0\false. .sum should give the same result. But maybe more clearly to use count_nonzero

Testing speed:

In [23]: data = np.random.random((10000,10000))

In [24]: data[[np.random.random_integers(0,10000, 100)],:][:, [np.random.random_integers(0,99, 100)]] = np.nan

In [25]: %timeit data.size - np.count_nonzero(np.isnan(data))
1 loops, best of 3: 309 ms per loop

In [26]: %timeit np.count_nonzero(~np.isnan(data))
1 loops, best of 3: 345 ms per loop

In [27]: %timeit data.size - np.isnan(data).sum()
1 loops, best of 3: 339 ms per loop

data.size - np.count_nonzero(np.isnan(data)) seems to barely be the fastest here. other data might give different relative speed results.

edited Feb 14 '14 at 14:14

answered Feb 14 '14 at 11:29

M4rtini

13,186
4
35
42

4

Maybe even `numpy.isnan(array).sum()`? I'm not very proficient with numpy though. – msvalkon Feb 14 '14 at 11:33
2

@msvalkon, It will count the number of NaN, while OP want the number of non-NaN elements. – falsetru Feb 14 '14 at 11:34
+1 Thnx guys for your help =) I can check the run-time of these alternatives =) – jjepsuomi Feb 14 '14 at 11:35
2

@goncalopp http://stackoverflow.com/questions/8305199/the-tilde-operator-in-python =) – jjepsuomi Feb 14 '14 at 11:37
@goncalopp the `~` inverts the matrix. I think it's implemented as a `__not__` for ndarrays. – M4rtini Feb 14 '14 at 11:39
6

An extension of @msvalkon answer: `data.size - np.isnan(data).sum()` will be slightly more efficient. – Daniel Feb 14 '14 at 13:45
@Ophion added some speed test, the difference was pretty small. – M4rtini Feb 14 '14 at 14:15

score 20 · Answer 2 · edited Nov 28 '22 at 15:04

20

Quick-to-write alternative

Even though is not the fastest choice, if performance is not an issue you can use:

sum(~np.isnan(data)).

Performance:

In [7]: %timeit data.size - np.count_nonzero(np.isnan(data))
10 loops, best of 3: 67.5 ms per loop

In [8]: %timeit sum(~np.isnan(data))
10 loops, best of 3: 154 ms per loop

In [9]: %timeit np.sum(~np.isnan(data))
10 loops, best of 3: 140 ms per loop

edited Nov 28 '22 at 15:04

ClimateUnboxed

7,106
3
41
86

answered May 03 '17 at 09:24

G M

20,759
10
81
84

This answer provides the sum which is not the same as counting the number of elements ... You should use `len` instead. – BenT Mar 28 '20 at 15:37
4

@BenT the sum of a bool array elements that meet a certain condition is the same providing the len of a subset array with the elements that meet a certain condition. Can you please clarify where this is wrong? – G M Mar 30 '20 at 09:26
3

My mistake I forgot a Boolean got return. – BenT Mar 30 '20 at 13:55

score 5 · Answer 3 · answered Mar 20 '19 at 03:04

5

To determine if the array is sparse, it may help to get a proportion of nan values

np.isnan(ndarr).sum() / ndarr.size

If that proportion exceeds a threshold, then use a sparse array, e.g. - https://sparse.pydata.org/en/latest/

answered Mar 20 '19 at 03:04

Darren Weber

1,537
19
20

This is very helpful and interesting but was this answer intended for a different question..? :) – jtlz2 Mar 08 '22 at 19:09

score 4 · Answer 4 · answered Feb 15 '17 at 00:38

An alternative, but a bit slower alternative is to do it over indexing.

np.isnan(data)[np.isnan(data) == False].size

In [30]: %timeit np.isnan(data)[np.isnan(data) == False].size
1 loops, best of 3: 498 ms per loop

The double use of np.isnan(data) and the == operator might be a bit overkill and so I posted the answer only for completeness.

score -1 · Answer 5 · edited Jun 27 '22 at 17:15

-1

len([i for i in data if np.isnan(i) == True])

edited Jun 27 '22 at 17:15

PirateNinjas

1,908
1
16
21

answered Jun 27 '22 at 07:22

Rinaldi Sirait

1

What is the question ? – mmeisson Jun 27 '22 at 13:48
Downvoted: the loop will be slower than any of the vectorized variants in the other answers and creating a list will take more memory (and I would argue harder to read). Also `if x == True` is very silly! – simlmx Mar 30 '23 at 21:10

Counting the number of non-NaN elements in a numpy ndarray in Python

5 Answers5

Quick-to-write alternative

Performance:

Linked

Related