4

I am operating the UCI data sets, some of them contains "?" in lines. For example:

56.0,1.0,2.0,130.0,221.0,0.0,2.0,163.0,0.0,0.0,1.0,0.0,7.0,0
58.0,1.0,2.0,125.0,220.0,0.0,0.0,144.0,0.0,0.4,2.0,?,7.0,0
57.0,0.0,2.0,130.0,236.0,0.0,2.0,174.0,0.0,0.0,2.0,1.0,3.0,1
38.0,1.0,3.0,138.0,175.0,0.0,0.0,173.0,0.0,0.0,1.0,?,3.0,0

I firstly use numpy.loadtxt() to load file, and try to delete the lines with "?" using line.contains('?'), but got error with the type.

Then I use pandas.read_csv, however, I still have no easy way to delete all lines contains a specific letter "?".

Is there any easy way to clean the data? I need a float type data file without any "?" in it. Thanks~

flyingmouse
  • 1,014
  • 3
  • 13
  • 29

4 Answers4

3

You can do this with Pandas.

import pandas as pd

df = pd.read_csv('file.csv')
df = df.replace('?', pd.np.nan)
df = df.dropna()
ComputerFellow
  • 11,710
  • 12
  • 50
  • 61
3

If you have all others numbers you could use answer for that question and then use dropna method
that could be used only with pandas version >= 0.17.0:

df.apply(pd.to_numeric, args=('coerce',)).dropna()

In [247]: df
Out[247]:
   0   1   2    3    4   5   6    7   8    9   10   11  12  13
0  56   1   2  130  221   0   2  163   0  0.0   1  0.0   7   0
1  58   1   2  125  220   0   0  144   0  0.4   2    ?   7   0
2  57   0   2  130  236   0   2  174   0  0.0   2  1.0   3   1
3  38   1   3  138  175   0   0  173   0  0.0   1    ?   3   0

In [248]: df.apply(pd.to_numeric, args=('coerce',)).dropna()
Out[248]:
   0   1   2    3    4   5   6    7   8   9   10  11  12  13
0  56   1   2  130  221   0   2  163   0   0   1   0   7   0
2  57   0   2  130  236   0   2  174   0   0   2   1   3   1
Community
  • 1
  • 1
Anton Protopopov
  • 30,354
  • 12
  • 88
  • 93
1

Create a small script to read the file line by line, and output "desirable" lines to a new file. Then operate on the cleaned data.

The way I would do it is with this:

import fileinput

for line in fileinput.input():
    if '?' not in line:
        print line.strip() # to avoid adding new newlines

and run it in bash as follows

python script.py < dirty.txt > clean.txt

This uses stdin and stdout to process the file, and bash stdin/stdout redirects to read/write to/from files.

An alternative pure python solution:

input_file = 'dirty.txt'
output_file = 'clean.txt'

with open(input_file) as dirty:
    with open(output_file, 'w') as clean:
        for line in dirty:
            if '?' not in line:
                clean.write(line)
mleyfman
  • 635
  • 4
  • 13
  • If bash can be used as opposed to python, the job could be accomplished with just: `grep -v "?" < dirty.txt > clean.txt` – Sean Johnson Nov 30 '15 at 07:35
  • I directly using `outfile.write(line)` after your code `if '?' not in line:`, and it is done. thank you . – flyingmouse Nov 30 '15 at 07:39
  • Another concern: when I use `line.strip()`, I got a output without `\n`. So is it necessary to use `strip()` here? – flyingmouse Nov 30 '15 at 07:40
  • @flyingmouse, when you print, it automatically adds `\n`. When writing to file, you should remove the `.strip()` as `write` doesn't add `\n` for you. – mleyfman Nov 30 '15 at 07:43
1

If you use pandas read_csv you can tell it to treat ? entries as NaN and then call dropna:

In [9]:
import io
import pandas as pd
​
t="""56.0,1.0,2.0,130.0,221.0,0.0,2.0,163.0,0.0,0.0,1.0,0.0,7.0,0
58.0,1.0,2.0,125.0,220.0,0.0,0.0,144.0,0.0,0.4,2.0,?,7.0,0
57.0,0.0,2.0,130.0,236.0,0.0,2.0,174.0,0.0,0.0,2.0,1.0,3.0,1
38.0,1.0,3.0,138.0,175.0,0.0,0.0,173.0,0.0,0.0,1.0,?,3.0,0"""
​
df = pd.read_csv(io.StringIO(t), na_values=['?'], header=None)
df

Out[9]:
   0   1   2    3    4   5   6    7   8    9   10  11  12  13
0  56   1   2  130  221   0   2  163   0  0.0   1   0   7   0
1  58   1   2  125  220   0   0  144   0  0.4   2 NaN   7   0
2  57   0   2  130  236   0   2  174   0  0.0   2   1   3   1
3  38   1   3  138  175   0   0  173   0  0.0   1 NaN   3   0

In [10]:    
df.dropna()

Out[10]:
   0   1   2    3    4   5   6    7   8   9   10  11  12  13
0  56   1   2  130  221   0   2  163   0   0   1   0   7   0
2  57   0   2  130  236   0   2  174   0   0   2   1   3   1

So in your case the following should work:

import pandas as pd
df = pd.read_csv('dirty.txt', header=None, na_values=['?']).dropna()
df.to_csv('clean.txt', header=False, index=False)
EdChum
  • 376,765
  • 198
  • 813
  • 562