41

How can I calculate matrix mean values along a matrix, but to remove nan values from calculation? (For R people, think na.rm = TRUE).

Here is my [non-]working example:

import numpy as np
dat = np.array([[1, 2, 3],
                [4, 5, np.nan],
                [np.nan, 6, np.nan],
                [np.nan, np.nan, np.nan]])
print(dat)
print(dat.mean(1))  # [  2.  nan  nan  nan]

With NaNs removed, my expected output would be:

array([ 2.,  4.5,  6.,  nan])
Mike T
  • 41,085
  • 18
  • 152
  • 203

12 Answers12

35

I think what you want is a masked array:

dat = np.array([[1,2,3], [4,5,'nan'], ['nan',6,'nan'], ['nan','nan','nan']])
mdat = np.ma.masked_array(dat,np.isnan(dat))
mm = np.mean(mdat,axis=1)
print mm.filled(np.nan) # the desired answer

Edit: Combining all of the timing data

   from timeit import Timer
    
    setupstr="""
import numpy as np
from scipy.stats.stats import nanmean    
dat = np.random.normal(size=(1000,1000))
ii = np.ix_(np.random.randint(0,99,size=50),np.random.randint(0,99,size=50))
dat[ii] = np.nan
"""  

    method1="""
mdat = np.ma.masked_array(dat,np.isnan(dat))
mm = np.mean(mdat,axis=1)
mm.filled(np.nan)    
"""
    
    N = 2
    t1 = Timer(method1, setupstr).timeit(N)
    t2 = Timer("[np.mean([l for l in d if not np.isnan(l)]) for d in dat]", setupstr).timeit(N)
    t3 = Timer("np.array([r[np.isfinite(r)].mean() for r in dat])", setupstr).timeit(N)
    t4 = Timer("np.ma.masked_invalid(dat).mean(axis=1)", setupstr).timeit(N)
    t5 = Timer("nanmean(dat,axis=1)", setupstr).timeit(N)
    
    print 'Time: %f\tRatio: %f' % (t1,t1/t1 )
    print 'Time: %f\tRatio: %f' % (t2,t2/t1 )
    print 'Time: %f\tRatio: %f' % (t3,t3/t1 )
    print 'Time: %f\tRatio: %f' % (t4,t4/t1 )
    print 'Time: %f\tRatio: %f' % (t5,t5/t1 )

Returns:

Time: 0.045454  Ratio: 1.000000
Time: 8.179479  Ratio: 179.950595
Time: 0.060988  Ratio: 1.341755
Time: 0.070955  Ratio: 1.561029
Time: 0.065152  Ratio: 1.433364
JoshAdel
  • 66,734
  • 27
  • 141
  • 140
  • 1
    I think scipy.nanmean should be the first thing you try. I wonder if it is still slow? – mathtick Nov 13 '12 at 16:12
  • @mathtick There are a variety of ways of accomplishing what the OP asked. I offered one such method that is a bit more verbose, but is faster than all of the other suggested ones that are benchmarked above, at least on my machine (this still holds true now with updated versions of scipy and numpy). – JoshAdel Nov 13 '12 at 22:19
  • 4
    @mathtick Furthermore, there is no `scipy.nanmean` method in scipy 0.10 or 0.11 as far as I can tell. There is `scipy.stats.stats.nanmean` and `scipy.stats.nanmean`, which are equivalent and I tested above. – JoshAdel Nov 13 '12 at 22:25
  • Sorry, that should be scipy.stats.nanmean ... and I'm running cipy.__version__ '0.10.1'. – mathtick Nov 14 '12 at 16:49
  • scipy.stats.nanmean and .nanstd do axis= too (with default axis=0 not None) – denis Nov 17 '12 at 17:22
  • I tested this in one dimension and `np.nansum(dat) / np.sum(~np.isnan(dat))` is slightly faster than `np.mean(np.ma.masked_array(dat, np.isnan(dat)))`. However, as pointed out earlier, bottleneck is 10x faster. – Dr. Jan-Philip Gehrcke Mar 01 '13 at 15:02
  • It seems `np.nansum(dat)` is the best. `Python 2.7.11 |Anaconda 2.4.1 (64-bit) IPython 4.0.1 In[190]: %timeit method1() 100 loops, best of 3: 7.09 ms per loop In[191]: %timeit [np.mean([l for l in d if not np.isnan(l)]) for d in dat] 1 loops, best of 3: 1.04 s per loop In[192]: %timeit np.array([r[np.isfinite(r)].mean() for r in dat]) 10 loops, best of 3: 19.6 ms per loop In[193]: %timeit np.ma.masked_invalid(dat).mean(axis=1) 100 loops, best of 3: 11.8 ms per loop In[194]: %timeit nanmean(dat,axis=1) 100 loops, best of 3: 6.36 ms per loop` – Sklavit Feb 11 '16 at 16:17
19

If performance matters, you should use bottleneck.nanmean() instead:

http://pypi.python.org/pypi/Bottleneck

deprecated
  • 2,030
  • 16
  • 11
12

From numpy 1.8 (released 2013-10-30) onwards, nanmean does precisely what you need:

>>> import numpy as np
>>> np.nanmean(np.array([1.5, 3.5, np.nan]))
2.5
Pont
  • 333
  • 3
  • 12
Alexander
  • 105,104
  • 32
  • 201
  • 196
12

Assuming you've also got SciPy installed:

http://www.scipy.org/doc/api_docs/SciPy.stats.stats.html#nanmean

Shaun Dubuque
  • 236
  • 2
  • 4
  • 5
    Just for completeness since I've timed all of the other code - `stats.stats.nanmean` is ~1.5x slower than the `np.ma` solution. – JoshAdel Mar 30 '11 at 13:37
8

A masked array with the nans filtered out can also be created on the fly:

print np.ma.masked_invalid(dat).mean(1)
Sven Marnach
  • 574,206
  • 118
  • 941
  • 841
  • I hadn't thought to use this. It's a nice one-liner, but it's still ~1.5-2x slower than my solution in my tests. Still +1 for exposing me to a `np.ma` method that I hadn't looked at before. – JoshAdel Mar 30 '11 at 13:29
8

You can always find a workaround in something like:

numpy.nansum(dat, axis=1) / numpy.sum(numpy.isfinite(dat), axis=1)

Numpy 2.0's numpy.mean has a skipna option which should take care of that.

Benjamin
  • 11,560
  • 13
  • 70
  • 119
3

How about using Pandas to do this:

import numpy as np
import pandas as pd
dat = np.array([[1, 2, 3], [4, 5, np.nan], [np.nan, 6, np.nan], [np.nan, np.nan, np.nan]])
print dat
print dat.mean(1)

df = pd.DataFrame(dat)
print df.mean(axis=1)

Gives:

0    2.0
1    4.5
2    6.0
3    NaN
zbinsd
  • 4,084
  • 6
  • 33
  • 40
3

This is built upon the solution suggested by JoshAdel.

Define the following function:

def nanmean(data, **args):
    return numpy.ma.filled(numpy.ma.masked_array(data,numpy.isnan(data)).mean(**args), fill_value=numpy.nan)

Example use:

data = [[0, 1, numpy.nan], [8, 5, 1]]
data = numpy.array(data)
print data
print nanmean(data)
print nanmean(data, axis=0)
print nanmean(data, axis=1)

Will print out:

[[  0.   1.  nan]
 [  8.   5.   1.]]

3.0

[ 4.  3.  1.]

[ 0.5         4.66666667]
1

Or you use laxarray, freshly uploaded, which is among other a wrapper for masked arrays.

import laxarray as la
la.array(dat).mean(axis=1)

following JoshAdel's protocoll I get:

Time: 0.048791  Ratio: 1.000000   
Time: 0.062242  Ratio: 1.275689   # laxarray's one-liner

So laxarray is marginally slower (would need to check why, maybe fixable), but much easier to use and allow labelling dimensions with strings.

check out: https://github.com/perrette/laxarray

EDIT: I have checked with another module, "la", larry, which beats all tests:

import la
la.larry(dat).mean(axis=1)

By hand, Time: 0.049013 Ratio: 1.000000
Larry,   Time: 0.005467 Ratio: 0.111540
laxarray Time: 0.061751 Ratio: 1.259889

Impressive !

Mahé
  • 445
  • 4
  • 9
1

One more speed check for all proposed approaches:

Python 2.7.11 |Anaconda 2.4.1 (64-bit)| (default, Jan 19 2016, 12:08:31) [MSC v.1500 64 bit (AMD64)]
IPython 4.0.1 -- An enhanced Interactive Python.

import numpy as np
from scipy.stats.stats import nanmean    
dat = np.random.normal(size=(1000,1000))
ii = np.ix_(np.random.randint(0,99,size=50),np.random.randint(0,99,size=50))
dat[ii] = np.nan
In[185]: def method1():
    mdat = np.ma.masked_array(dat,np.isnan(dat))
    mm = np.mean(mdat,axis=1)
    mm.filled(np.nan) 

In[190]: %timeit method1()
100 loops, best of 3: 7.09 ms per loop
In[191]: %timeit [np.mean([l for l in d if not np.isnan(l)]) for d in dat]
1 loops, best of 3: 1.04 s per loop
In[192]: %timeit np.array([r[np.isfinite(r)].mean() for r in dat])
10 loops, best of 3: 19.6 ms per loop
In[193]: %timeit np.ma.masked_invalid(dat).mean(axis=1)
100 loops, best of 3: 11.8 ms per loop
In[194]: %timeit nanmean(dat,axis=1)
100 loops, best of 3: 6.36 ms per loop
In[195]: import bottleneck as bn
In[196]: %timeit bn.nanmean(dat,axis=1)
1000 loops, best of 3: 1.05 ms per loop
In[197]: from scipy import stats
In[198]: %timeit stats.nanmean(dat)
100 loops, best of 3: 6.19 ms per loop

So the best is 'bottleneck.nanmean(dat, axis=1)' 'scipy.stats.nanmean(dat)' is not faster then numpy.nanmean(dat, axis=1).

Sklavit
  • 2,225
  • 23
  • 29
0
# I suggest you this way:
import numpy as np
dat  = np.array([[1, 2, 3], [4, 5, np.nan], [np.nan, 6, np.nan], [np.nan, np.nan, np.nan]])
dat2 = np.ma.masked_invalid(dat)
print np.mean(dat2, axis=1)   
GiO
  • 1
-1
'''define dataMat'''
numFeat= shape(datMat)[1]
for i in range(numFeat):
     meanVal=mean(dataMat[nonzero(~isnan(datMat[:,i].A))[0],i])
thor
  • 21,418
  • 31
  • 87
  • 173
subh
  • 23
  • 7