59

I've got a numpy array filled mostly with real numbers, but there is a few nan values in it as well.

How can I replace the nans with averages of columns where they are?

piokuc
  • 25,594
  • 11
  • 72
  • 102

8 Answers8

95

No loops required:

print(a)
[[ 0.93230948         nan  0.47773439  0.76998063]
 [ 0.94460779  0.87882456  0.79615838  0.56282885]
 [ 0.94272934  0.48615268  0.06196785         nan]
 [ 0.64940216  0.74414127         nan         nan]]

#Obtain mean of columns as you need, nanmean is convenient.
col_mean = np.nanmean(a, axis=0)
print(col_mean)
[ 0.86726219  0.7030395   0.44528687  0.66640474]

#Find indices that you need to replace
inds = np.where(np.isnan(a))

#Place column means in the indices. Align the arrays using take
a[inds] = np.take(col_mean, inds[1])

print(a)
[[ 0.93230948  0.7030395   0.47773439  0.76998063]
 [ 0.94460779  0.87882456  0.79615838  0.56282885]
 [ 0.94272934  0.48615268  0.06196785  0.66640474]
 [ 0.64940216  0.74414127  0.44528687  0.66640474]]
nschmeller
  • 79
  • 1
  • 2
  • 9
Daniel
  • 19,179
  • 7
  • 60
  • 74
  • 1
    Nice answer. I didn't know nanmean existed! (+1) – Hammer Sep 08 '13 at 22:54
  • 5
    any reason you use take instead of just indexing? – Hammer Sep 08 '13 at 22:58
  • 1
    @Hammer They are adding nanmean to numpy in 1.8. Should be interesting. I use take instead of fancy indexing due to [this](http://stackoverflow.com/questions/14491480/using-numpy-take-for-faster-fancy-indexing) question. There is a lot of evidence that indexing is ~5x slower then take. Plus this works in older versions also. – Daniel Sep 08 '13 at 23:00
  • @Jaime Can you elaborate on that some? – Daniel Sep 08 '13 at 23:10
  • 8
    You can now use numpy.nanmean() instead of import scipy: http://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.nanmean.html – crypdick May 25 '16 at 21:06
  • for a more up to date solution see @Donald Hobson's answer – LetsPlayYahtzee Oct 23 '16 at 01:13
  • @LetsPlayYahtzee Note to use that answer you would have to clone the `col_mean` array for every row in your original array. For this particular question the two are not comparable in terms of efficiency. – Daniel Oct 23 '16 at 19:18
  • I don't think that this is true, I believe it works even if col_mean is a row – LetsPlayYahtzee Oct 23 '16 at 20:21
  • Hmm, looks like they have added additional broadcasting capabilities to `where.` – Daniel Oct 23 '16 at 21:33
17

Using masked arrays

The standard way to do this using only numpy would be to use the masked array module.

Scipy is a pretty heavy package which relies on external libraries, so it's worth having a numpy-only method. This borrows from @DonaldHobson's answer.

Edit: np.nanmean is now a numpy function. However, it doesn't handle all-nan columns...

Suppose you have an array a:

>>> a
array([[  0.,  nan,  10.,  nan],
       [  1.,   6.,  nan,  nan],
       [  2.,   7.,  12.,  nan],
       [  3.,   8.,  nan,  nan],
       [ nan,   9.,  14.,  nan]])

>>> import numpy.ma as ma
>>> np.where(np.isnan(a), ma.array(a, mask=np.isnan(a)).mean(axis=0), a)    
array([[  0. ,   7.5,  10. ,   0. ],
       [  1. ,   6. ,  12. ,   0. ],
       [  2. ,   7. ,  12. ,   0. ],
       [  3. ,   8. ,  12. ,   0. ],
       [  1.5,   9. ,  14. ,   0. ]])

Note that the masked array's mean does not need to be the same shape as a, because we're taking advantage of the implicit broadcasting over rows.

Also note how the all-nan column is nicely handled. The mean is zero since you're taking the mean of zero elements. The method using nanmean doesn't handle all-nan columns:

>>> col_mean = np.nanmean(a, axis=0)
/home/praveen/.virtualenvs/numpy3-mkl/lib/python3.4/site-packages/numpy/lib/nanfunctions.py:675: RuntimeWarning: Mean of empty slice
  warnings.warn("Mean of empty slice", RuntimeWarning)
>>> inds = np.where(np.isnan(a))
>>> a[inds] = np.take(col_mean, inds[1])
>>> a
array([[  0. ,   7.5,  10. ,   nan],
       [  1. ,   6. ,  12. ,   nan],
       [  2. ,   7. ,  12. ,   nan],
       [  3. ,   8. ,  12. ,   nan],
       [  1.5,   9. ,  14. ,   nan]])

Explanation

Converting a into a masked array gives you

>>> ma.array(a, mask=np.isnan(a))
masked_array(data =
 [[0.0 --  10.0 --]
  [1.0 6.0 --   --]
  [2.0 7.0 12.0 --]
  [3.0 8.0 --   --]
  [--  9.0 14.0 --]],
             mask =
 [[False  True False  True]
 [False False  True  True]
 [False False False  True]
 [False False  True  True]
 [ True False False  True]],
       fill_value = 1e+20)

And taking the mean over columns gives you the correct answer, normalizing only over the non-masked values:

>>> ma.array(a, mask=np.isnan(a)).mean(axis=0)
masked_array(data = [1.5 7.5 12.0 --],
             mask = [False False False  True],
       fill_value = 1e+20)

Further, note how the mask nicely handles the column which is all-nan!

Finally, np.where does the job of replacement.


Row-wise mean

To replace nan values with row-wise mean instead of column-wise mean requires a tiny change for broadcasting to take effect nicely:

>>> a
array([[  0.,   1.,   2.,   3.,  nan],
       [ nan,   6.,   7.,   8.,   9.],
       [ 10.,  nan,  12.,  nan,  14.],
       [ nan,  nan,  nan,  nan,  nan]])

>>> np.where(np.isnan(a), ma.array(a, mask=np.isnan(a)).mean(axis=1), a)
ValueError: operands could not be broadcast together with shapes (4,5) (4,) (4,5)

>>> np.where(np.isnan(a), ma.array(a, mask=np.isnan(a)).mean(axis=1)[:, np.newaxis], a)
array([[  0. ,   1. ,   2. ,   3. ,   1.5],
       [  7.5,   6. ,   7. ,   8. ,   9. ],
       [ 10. ,  12. ,  12. ,  12. ,  14. ],
       [  0. ,   0. ,   0. ,   0. ,   0. ]])
Praveen
  • 6,872
  • 3
  • 43
  • 62
  • IMO there's nothing wrong with having `np.nan` values as means for all-NaN column case. But it is indeed a neat case of use for masked arrays. – Vlas Sokolov Oct 24 '16 at 00:39
  • @VlasSokolov Well, having a mask is even better I think. i.e., making `a` into a masked array and keeping it masked even _after_ applying the mean. Then you don't need to worry about performing operations on it, which might cause the `nan`s to "spread" to the non-`nan` values. – Praveen Oct 24 '16 at 00:44
6

If partial is your original data, and replace is an array of the same shape containing averaged values then this code will use the value from partial if one exists.

Complete= np.where(np.isnan(partial),replace,partial)
Donald Hobson
  • 194
  • 1
  • 3
4

Alternative: Replacing NaNs with interpolation of columns.

def interpolate_nans(X):
    """Overwrite NaNs with column value interpolations."""
    for j in range(X.shape[1]):
        mask_j = np.isnan(X[:,j])
        X[mask_j,j] = np.interp(np.flatnonzero(mask_j), np.flatnonzero(~mask_j), X[~mask_j,j])
    return X

Example use:

X_incomplete = np.array([[10,     20,     30    ],
                         [np.nan, 30,     np.nan],
                         [np.nan, np.nan, 50    ],
                         [40,     50,     np.nan    ]])

X_complete = interpolate_nans(X_incomplete)

print X_complete
[[10,     20,     30    ],
 [20,     30,     40    ],
 [30,     40,     50    ],
 [40,     50,     50    ]]

I use this bit of code for time series data in particular, where columns are attributes and rows are time-ordered samples.

Ulf Aslak
  • 7,876
  • 4
  • 34
  • 56
2

This isn't very clean but I can't think of a way to do it other than iterating

#example
a = np.arange(16, dtype = float).reshape(4,4)
a[2,2] = np.nan
a[3,3] = np.nan

indices = np.where(np.isnan(a)) #returns an array of rows and column indices
for row, col in zip(*indices):
    a[row,col] = np.mean(a[~np.isnan(a[:,col]), col])
Hammer
  • 10,109
  • 1
  • 36
  • 52
2

To extend Donald's Answer I provide a minimal example. Let's say a is an ndarray and we want to replace its zero values with the mean of the column.

In [231]: a
Out[231]: 
array([[0, 3, 6],
       [2, 0, 0]])


In [232]: col_mean = np.nanmean(a, axis=0)
Out[232]: array([ 1. ,  1.5,  3. ])

In [228]: np.where(np.equal(a, 0), col_mean, a)
Out[228]: 
array([[ 1. ,  3. ,  6. ],
       [ 2. ,  1.5,  3. ]])
LetsPlayYahtzee
  • 7,161
  • 12
  • 41
  • 65
0

Using simple functions with loops:

a=[[0.93230948, np.nan, 0.47773439, 0.76998063],
  [0.94460779, 0.87882456, 0.79615838, 0.56282885],
  [0.94272934, 0.48615268, 0.06196785, np.nan],
  [0.64940216, 0.74414127, np.nan, np.nan],
  [0.64940216, 0.74414127, np.nan, np.nan]]

print("------- original array -----")
for aa in a:
    print(aa)

# GET COLUMN MEANS: 
ta = np.array(a).T.tolist()                         # transpose the array; 
col_means = list(map(lambda x: np.nanmean(x), ta))  # get means; 
print("column means:", col_means)

# REPLACE NAN ENTRIES WITH COLUMN MEANS: 
nrows = len(a); ncols = len(a[0]) # get number of rows & columns; 
for r in range(nrows):
    for c in range(ncols):
        if np.isnan(a[r][c]):
            a[r][c] = col_means[c]

print("------- means added -----")
for aa in a:
    print(aa)

Output:

------- original array -----
[0.93230948, nan, 0.47773439, 0.76998063]
[0.94460779, 0.87882456, 0.79615838, 0.56282885]
[0.94272934, 0.48615268, 0.06196785, nan]
[0.64940216, 0.74414127, nan, nan]
[0.64940216, 0.74414127, nan, nan]

column means: [0.82369018599999999, 0.71331494500000003, 0.44528687333333333, 0.66640474000000005]

------- means added -----
[0.93230948, 0.71331494500000003, 0.47773439, 0.76998063]
[0.94460779, 0.87882456, 0.79615838, 0.56282885]
[0.94272934, 0.48615268, 0.06196785, 0.66640474000000005]
[0.64940216, 0.74414127, 0.44528687333333333, 0.66640474000000005]
[0.64940216, 0.74414127, 0.44528687333333333, 0.66640474000000005]

The for loops can also be written with list comprehension:

new_a = [[col_means[c] if np.isnan(a[r][c]) else a[r][c] 
            for c in range(ncols) ]
        for r in range(nrows) ]
rnso
  • 23,686
  • 25
  • 112
  • 234
-3

you might want to try this built-in function:

x = np.array([np.inf, -np.inf, np.nan, -128, 128])
np.nan_to_num(x)
array([  1.79769313e+308,  -1.79769313e+308,   0.00000000e+000,
-1.28000000e+002,   1.28000000e+002])
ifryed
  • 605
  • 9
  • 21