3

I would like to replace missing data points with mean from each column in text with python.

So, my idea was:

  1. Read each column from text file
  2. Calculate a mean of each column
  3. Replace nan with calculated mean in each column
  4. Write them back to a new text file

I think that I am ok til step 2, but I have a trouble for step 3 and 4. My code is as follows;

for columns in ( raw.strip().split() for raw in f ):
    a.append(columns[c])
    x = np.array(a, float)
    y = np.ma.masked_array(x,np.isnan(x))
    y1 = np.mean(y)
    a1 = ' '.join(a)
    a1.replace("nan", "y1")
    f1 = open("practice.txt", "w")
    f1.write(a1)

As you can see, the problem here is related to replacing nan with mean with 'replace' command, because it is only dealing with string. I will really appreciate any help or suggestion. A part of my data looks like below

1.60566 nan 2.00755 2.32407
1.502   nan 1.36522 1.555
0.63333 nan 1.56102 2.08929
nan nan 0.87451 1.06667
2.5 nan 1.88889 1.0661
3.88197 nan 3.0875  2.75909
4.02692 nan 3.36154 3.92895
5.9907  nan 5.29535 5.82245
6.16111 2.67317 6.04074 6.25588
6.88269 2.62241 5.43958 6.07
5.92    2.48627 5.91818 6.75862
6.93429 6.17333 7.34    7.76538
8.25143 7.925   7.8087  8.725
8.1025  8.19429 8.11563 8.80937
8.12105 8.145   7.83889 8.37576
7.47292 8.65    8.35536 8.61081
8.10392 8.66032 8.74082 9.65484
10.03036    10.74727    10.634  10.50961

I want to replace those nans with mean value in each column.

unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
Isaac
  • 885
  • 2
  • 15
  • 35

3 Answers3

2

your problem is that y1 is not a string? you can just: a1.replace("nan", str(y1))

cmd
  • 5,754
  • 16
  • 30
2

Remember that replace does not replace the string in-place, you have to do something like this:

a1 = a1.replace("nan", str(y1))
Óscar López
  • 232,561
  • 37
  • 312
  • 386
  • 1
    @Isaac you're welcome! if this or any other answer was helpful for you, please consider [accepting](http://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work) it by clicking on the check mark to its left - that's the way you say "thank you" in Stack Overflow ;) – Óscar López Apr 09 '13 at 21:55
  • Do you have any idea to write resulting string which is a1 as a column in a new text file, not a row? – Isaac Apr 09 '13 at 22:16
  • No without knowing the rest of the elements of the column. What you can do is store the whole thing (all rows and columns) in a matrix (a list of sublists, each sublist being a row), then transpose that matrix and write the result row by row – Óscar López Apr 09 '13 at 22:21
2

You could use the masked array filled method:

import numpy as np

filename = '/tmp/data'
with open(filename, 'w') as f:
    f.write('''
1 2 nan
2 nan 3
nan 3 4
nan nan nan
''')

arr = np.genfromtxt(filename)
print(arr)
# [[  1.   2.  nan]
#  [  2.  nan   3.]
#  [ nan   3.   4.]
#  [ nan  nan  nan]]

mask = np.isnan(arr)
masked_arr = np.ma.masked_array(arr, mask)
means = np.mean(masked_arr, axis=0)

print(means)
# [1.5 2.5 3.5]

With the above setup,

print(masked_arr.filled(means))

yields

[[ 1.   2.   3.5]
 [ 2.   2.5  3. ]
 [ 1.5  3.   4. ]
 [ 1.5  2.5  3.5]]

Then, to write the array to a file, you could use np.savetxt:

np.savetxt(filename, masked_arr.filled(means), fmt='%0.2f')
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • It looks pretty useful! Thanks unutbu! I will try now. – Isaac Apr 09 '13 at 22:18
  • I have a question for this method. You seem to use the rows of string inside of f.write(). My string in text file is too big. Is there any way to use my total text file data inside of this f.write()? – Isaac Apr 10 '13 at 02:53
  • The `f.write` was used just to create some data in a file. You already have your data in a file. So you can skip the `f.write` part. Is your data so large that `arr = np.genfromtxt(filename)` fails? – unutbu Apr 10 '13 at 06:18