Replacing nan with mean

Question

I would like to replace missing data points with mean from each column in text with python.

So, my idea was:

Read each column from text file
Calculate a mean of each column
Replace nan with calculated mean in each column
Write them back to a new text file

I think that I am ok til step 2, but I have a trouble for step 3 and 4. My code is as follows;

for columns in ( raw.strip().split() for raw in f ):
    a.append(columns[c])
    x = np.array(a, float)
    y = np.ma.masked_array(x,np.isnan(x))
    y1 = np.mean(y)
    a1 = ' '.join(a)
    a1.replace("nan", "y1")
    f1 = open("practice.txt", "w")
    f1.write(a1)

As you can see, the problem here is related to replacing nan with mean with 'replace' command, because it is only dealing with string. I will really appreciate any help or suggestion. A part of my data looks like below

1.60566 nan 2.00755 2.32407
1.502   nan 1.36522 1.555
0.63333 nan 1.56102 2.08929
nan nan 0.87451 1.06667
2.5 nan 1.88889 1.0661
3.88197 nan 3.0875  2.75909
4.02692 nan 3.36154 3.92895
5.9907  nan 5.29535 5.82245
6.16111 2.67317 6.04074 6.25588
6.88269 2.62241 5.43958 6.07
5.92    2.48627 5.91818 6.75862
6.93429 6.17333 7.34    7.76538
8.25143 7.925   7.8087  8.725
8.1025  8.19429 8.11563 8.80937
8.12105 8.145   7.83889 8.37576
7.47292 8.65    8.35536 8.61081
8.10392 8.66032 8.74082 9.65484
10.03036    10.74727    10.634  10.50961

I want to replace those nans with mean value in each column.

Yes, you are right Antimony. I make a string to use 'replace", but it doesn't work. — Isaac, Apr 09 '13 at 21:44

cmd · Answer 1 · 2013-04-09T22:21:08.430

2

your problem is that y1 is not a string? you can just: a1.replace("nan", str(y1))

edited Apr 09 '13 at 22:21

answered Apr 09 '13 at 21:44

cmd

5,754
16
30

score 2 · Accepted Answer · answered Apr 09 '13 at 21:49

2

Remember that replace does not replace the string in-place, you have to do something like this:

a1 = a1.replace("nan", str(y1))

answered Apr 09 '13 at 21:49

Óscar López

232,561
37
312
386

1

@Isaac you're welcome! if this or any other answer was helpful for you, please consider [accepting](http://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work) it by clicking on the check mark to its left - that's the way you say "thank you" in Stack Overflow ;) – Óscar López Apr 09 '13 at 21:55
Do you have any idea to write resulting string which is a1 as a column in a new text file, not a row? – Isaac Apr 09 '13 at 22:16
No without knowing the rest of the elements of the column. What you can do is store the whole thing (all rows and columns) in a matrix (a list of sublists, each sublist being a row), then transpose that matrix and write the result row by row – Óscar López Apr 09 '13 at 22:21

unutbu · Answer 3 · 2013-04-09T23:37:54.773

2

You could use the masked array filled method:

import numpy as np

filename = '/tmp/data'
with open(filename, 'w') as f:
    f.write('''
1 2 nan
2 nan 3
nan 3 4
nan nan nan
''')

arr = np.genfromtxt(filename)
print(arr)
# [[  1.   2.  nan]
#  [  2.  nan   3.]
#  [ nan   3.   4.]
#  [ nan  nan  nan]]

mask = np.isnan(arr)
masked_arr = np.ma.masked_array(arr, mask)
means = np.mean(masked_arr, axis=0)

print(means)
# [1.5 2.5 3.5]

With the above setup,

print(masked_arr.filled(means))

yields

[[ 1.   2.   3.5]
 [ 2.   2.5  3. ]
 [ 1.5  3.   4. ]
 [ 1.5  2.5  3.5]]

Then, to write the array to a file, you could use np.savetxt:

np.savetxt(filename, masked_arr.filled(means), fmt='%0.2f')

edited Apr 09 '13 at 23:37

answered Apr 09 '13 at 22:07

unutbu

842,883
184
1,785
1,677

It looks pretty useful! Thanks unutbu! I will try now. – Isaac Apr 09 '13 at 22:18
I have a question for this method. You seem to use the rows of string inside of f.write(). My string in text file is too big. Is there any way to use my total text file data inside of this f.write()? – Isaac Apr 10 '13 at 02:53
The `f.write` was used just to create some data in a file. You already have your data in a file. So you can skip the `f.write` part. Is your data so large that `arr = np.genfromtxt(filename)` fails? – unutbu Apr 10 '13 at 06:18

Replacing nan with mean

3 Answers3

Linked