NumPy: calculate averages with NaNs removed

Question

How can I calculate matrix mean values along a matrix, but to remove nan values from calculation? (For R people, think na.rm = TRUE).

Here is my [non-]working example:

import numpy as np
dat = np.array([[1, 2, 3],
                [4, 5, np.nan],
                [np.nan, 6, np.nan],
                [np.nan, np.nan, np.nan]])
print(dat)
print(dat.mean(1))  # [  2.  nan  nan  nan]

With NaNs removed, my expected output would be:

array([ 2.,  4.5,  6.,  nan])

Since numpy 1.8, there are nanmean and nanstd available. – Roman Shapovalov Oct 02 '14 at 12:42 — Roman Shapovalov, Oct 02 '14 at 12:42

score 35 · Accepted Answer · edited May 22 '22 at 19:26

35

I think what you want is a masked array:

dat = np.array([[1,2,3], [4,5,'nan'], ['nan',6,'nan'], ['nan','nan','nan']])
mdat = np.ma.masked_array(dat,np.isnan(dat))
mm = np.mean(mdat,axis=1)
print mm.filled(np.nan) # the desired answer

Edit: Combining all of the timing data

   from timeit import Timer
    
    setupstr="""
import numpy as np
from scipy.stats.stats import nanmean    
dat = np.random.normal(size=(1000,1000))
ii = np.ix_(np.random.randint(0,99,size=50),np.random.randint(0,99,size=50))
dat[ii] = np.nan
"""  

    method1="""
mdat = np.ma.masked_array(dat,np.isnan(dat))
mm = np.mean(mdat,axis=1)
mm.filled(np.nan)    
"""
    
    N = 2
    t1 = Timer(method1, setupstr).timeit(N)
    t2 = Timer("[np.mean([l for l in d if not np.isnan(l)]) for d in dat]", setupstr).timeit(N)
    t3 = Timer("np.array([r[np.isfinite(r)].mean() for r in dat])", setupstr).timeit(N)
    t4 = Timer("np.ma.masked_invalid(dat).mean(axis=1)", setupstr).timeit(N)
    t5 = Timer("nanmean(dat,axis=1)", setupstr).timeit(N)
    
    print 'Time: %f\tRatio: %f' % (t1,t1/t1 )
    print 'Time: %f\tRatio: %f' % (t2,t2/t1 )
    print 'Time: %f\tRatio: %f' % (t3,t3/t1 )
    print 'Time: %f\tRatio: %f' % (t4,t4/t1 )
    print 'Time: %f\tRatio: %f' % (t5,t5/t1 )

Returns:

Time: 0.045454  Ratio: 1.000000
Time: 8.179479  Ratio: 179.950595
Time: 0.060988  Ratio: 1.341755
Time: 0.070955  Ratio: 1.561029
Time: 0.065152  Ratio: 1.433364

edited May 22 '22 at 19:26

Wilson Sauthoff

165
12

answered Mar 30 '11 at 01:04

JoshAdel

66,734
27
141
140

1

I think scipy.nanmean should be the first thing you try. I wonder if it is still slow? – mathtick Nov 13 '12 at 16:12
@mathtick There are a variety of ways of accomplishing what the OP asked. I offered one such method that is a bit more verbose, but is faster than all of the other suggested ones that are benchmarked above, at least on my machine (this still holds true now with updated versions of scipy and numpy). – JoshAdel Nov 13 '12 at 22:19
4

@mathtick Furthermore, there is no `scipy.nanmean` method in scipy 0.10 or 0.11 as far as I can tell. There is `scipy.stats.stats.nanmean` and `scipy.stats.nanmean`, which are equivalent and I tested above. – JoshAdel Nov 13 '12 at 22:25
Sorry, that should be scipy.stats.nanmean ... and I'm running cipy.__version__ '0.10.1'. – mathtick Nov 14 '12 at 16:49
scipy.stats.nanmean and .nanstd do axis= too (with default axis=0 not None) – denis Nov 17 '12 at 17:22
I tested this in one dimension and `np.nansum(dat) / np.sum(~np.isnan(dat))` is slightly faster than `np.mean(np.ma.masked_array(dat, np.isnan(dat)))`. However, as pointed out earlier, bottleneck is 10x faster. – Dr. Jan-Philip Gehrcke Mar 01 '13 at 15:02
It seems `np.nansum(dat)` is the best. `Python 2.7.11 |Anaconda 2.4.1 (64-bit) IPython 4.0.1 In[190]: %timeit method1() 100 loops, best of 3: 7.09 ms per loop In[191]: %timeit [np.mean([l for l in d if not np.isnan(l)]) for d in dat] 1 loops, best of 3: 1.04 s per loop In[192]: %timeit np.array([r[np.isfinite(r)].mean() for r in dat]) 10 loops, best of 3: 19.6 ms per loop In[193]: %timeit np.ma.masked_invalid(dat).mean(axis=1) 100 loops, best of 3: 11.8 ms per loop In[194]: %timeit nanmean(dat,axis=1) 100 loops, best of 3: 6.36 ms per loop` – Sklavit Feb 11 '16 at 16:17

score 19 · Answer 2 · answered Mar 30 '11 at 01:10

19

If performance matters, you should use bottleneck.nanmean() instead:

http://pypi.python.org/pypi/Bottleneck

answered Mar 30 '11 at 01:10

deprecated

2,030
16
11

score 12 · Answer 3 · edited Jan 18 '17 at 19:57

12

From numpy 1.8 (released 2013-10-30) onwards, nanmean does precisely what you need:

>>> import numpy as np
>>> np.nanmean(np.array([1.5, 3.5, np.nan]))
2.5

edited Jan 18 '17 at 19:57

Pont

333
3
12

answered Mar 06 '16 at 20:52

Alexander

105,104
32
201
196

score 12 · Answer 4 · answered Mar 30 '11 at 01:02

12

Assuming you've also got SciPy installed:

http://www.scipy.org/doc/api_docs/SciPy.stats.stats.html#nanmean

answered Mar 30 '11 at 01:02

Shaun Dubuque

236
2
4

5

Just for completeness since I've timed all of the other code - `stats.stats.nanmean` is ~1.5x slower than the `np.ma` solution. – JoshAdel Mar 30 '11 at 13:37

score 8 · Answer 5 · answered Mar 30 '11 at 08:47

8

A masked array with the nans filtered out can also be created on the fly:

print np.ma.masked_invalid(dat).mean(1)

answered Mar 30 '11 at 08:47

Sven Marnach

574,206
118
941
841

I hadn't thought to use this. It's a nice one-liner, but it's still ~1.5-2x slower than my solution in my tests. Still +1 for exposing me to a `np.ma` method that I hadn't looked at before. – JoshAdel Mar 30 '11 at 13:29

Benjamin · Answer 6 · 2011-11-08T03:35:29.443

8

You can always find a workaround in something like:

numpy.nansum(dat, axis=1) / numpy.sum(numpy.isfinite(dat), axis=1)

Numpy 2.0's numpy.mean has a skipna option which should take care of that.

edited Nov 08 '11 at 03:35

answered Nov 08 '11 at 03:29

Benjamin

11,560
13
70
119

score 3 · Answer 7 · answered Jan 29 '14 at 18:44

How about using Pandas to do this:

import numpy as np
import pandas as pd
dat = np.array([[1, 2, 3], [4, 5, np.nan], [np.nan, 6, np.nan], [np.nan, np.nan, np.nan]])
print dat
print dat.mean(1)

df = pd.DataFrame(dat)
print df.mean(axis=1)

Gives:

score 3 · Answer 8 · answered Jan 12 '12 at 21:25

This is built upon the solution suggested by JoshAdel.

Define the following function:

def nanmean(data, **args):
    return numpy.ma.filled(numpy.ma.masked_array(data,numpy.isnan(data)).mean(**args), fill_value=numpy.nan)

Example use:

data = [[0, 1, numpy.nan], [8, 5, 1]]
data = numpy.array(data)
print data
print nanmean(data)
print nanmean(data, axis=0)
print nanmean(data, axis=1)

Will print out:

[[  0.   1.  nan]
 [  8.   5.   1.]]

3.0

[ 4.  3.  1.]

[ 0.5         4.66666667]

Mahé · Answer 9 · 2013-12-04T22:01:30.423

Or you use laxarray, freshly uploaded, which is among other a wrapper for masked arrays.

import laxarray as la
la.array(dat).mean(axis=1)

following JoshAdel's protocoll I get:

Time: 0.048791  Ratio: 1.000000   
Time: 0.062242  Ratio: 1.275689   # laxarray's one-liner

So laxarray is marginally slower (would need to check why, maybe fixable), but much easier to use and allow labelling dimensions with strings.

check out: https://github.com/perrette/laxarray

EDIT: I have checked with another module, "la", larry, which beats all tests:

import la
la.larry(dat).mean(axis=1)

By hand, Time: 0.049013 Ratio: 1.000000
Larry,   Time: 0.005467 Ratio: 0.111540
laxarray Time: 0.061751 Ratio: 1.259889

Impressive !

score 1 · Answer 10 · answered Feb 11 '16 at 16:28

One more speed check for all proposed approaches:

Python 2.7.11 |Anaconda 2.4.1 (64-bit)| (default, Jan 19 2016, 12:08:31) [MSC v.1500 64 bit (AMD64)]
IPython 4.0.1 -- An enhanced Interactive Python.

import numpy as np
from scipy.stats.stats import nanmean    
dat = np.random.normal(size=(1000,1000))
ii = np.ix_(np.random.randint(0,99,size=50),np.random.randint(0,99,size=50))
dat[ii] = np.nan
In[185]: def method1():
    mdat = np.ma.masked_array(dat,np.isnan(dat))
    mm = np.mean(mdat,axis=1)
    mm.filled(np.nan) 

In[190]: %timeit method1()
100 loops, best of 3: 7.09 ms per loop
In[191]: %timeit [np.mean([l for l in d if not np.isnan(l)]) for d in dat]
1 loops, best of 3: 1.04 s per loop
In[192]: %timeit np.array([r[np.isfinite(r)].mean() for r in dat])
10 loops, best of 3: 19.6 ms per loop
In[193]: %timeit np.ma.masked_invalid(dat).mean(axis=1)
100 loops, best of 3: 11.8 ms per loop
In[194]: %timeit nanmean(dat,axis=1)
100 loops, best of 3: 6.36 ms per loop
In[195]: import bottleneck as bn
In[196]: %timeit bn.nanmean(dat,axis=1)
1000 loops, best of 3: 1.05 ms per loop
In[197]: from scipy import stats
In[198]: %timeit stats.nanmean(dat)
100 loops, best of 3: 6.19 ms per loop

So the best is 'bottleneck.nanmean(dat, axis=1)' 'scipy.stats.nanmean(dat)' is not faster then numpy.nanmean(dat, axis=1).

score 0 · Answer 11 · answered Sep 27 '17 at 07:08

0

# I suggest you this way:
import numpy as np
dat  = np.array([[1, 2, 3], [4, 5, np.nan], [np.nan, 6, np.nan], [np.nan, np.nan, np.nan]])
dat2 = np.ma.masked_invalid(dat)
print np.mean(dat2, axis=1)

answered Sep 27 '17 at 07:08

GiO

1

score -1 · Answer 12 · edited Jul 24 '16 at 11:52

-1

'''define dataMat'''
numFeat= shape(datMat)[1]
for i in range(numFeat):
     meanVal=mean(dataMat[nonzero(~isnan(datMat[:,i].A))[0],i])

edited Jul 24 '16 at 11:52

thor

21,418
31
87
173

answered Jul 24 '16 at 11:45

subh

23
7

NumPy: calculate averages with NaNs removed

12 Answers12

Linked

Related