1

I am trying to clean a dataset in pandas, information is stored ona csv file and is imported using:

tester = pd.read_csv('date.csv')

Every column contains a '?' where the value is missing. For example there is an age column that contains 9 question marks (?)

I am trying to set the all the question marks to NaN, i have tried:

tester = pd.read_csv('date.csv', na_values=["?"])

tester['age'].replace("?", np.NaN)

tester.replace('?', np.NaN)


for col in tester :
    print tester[col].value_counts(dropna=False)

Still returns 0 for the age when I know there is 9 (?s). In this case I assume the check is failing as the value is never seen as being ?.

I have looked at the csv file in notepage and there is no space etc around the character.

Is there anyway of forcing this so that it is recognised?

sample data: enter image description here

TJ15
  • 353
  • 2
  • 10
  • 22
  • 1
    If `tester = pd.read_csv('date.csv', na_values=["?"])` not working it seems some data related issue – jezrael Nov 12 '18 at 15:24
  • 3
    I there are trailing whitespaces, `tester = pd.read_csv('date.csv', na_values=["?"], skipinitialspace=True)` should help... The best is check data in some editor like notepad++ – jezrael Nov 12 '18 at 15:25
  • since there can be multiple '??' use data['A'].apply(lambda x: x if '?' not in x else 'NaN' ) – iamklaus Nov 12 '18 at 15:28
  • @jezrael I have opened this in np++ and there is no characters, record 11, age is first column = ?,100,1.015,2 I have tried skip skipinitialspace=True but madeno difference – TJ15 Nov 12 '18 at 15:34
  • Then no idea... If data are not confidental, you can share it – jezrael Nov 12 '18 at 15:35
  • @TJ15, can you try .. `pd.read_csv('file', sep=', ' , engine='python') ` as there looks to be spaces after comma in the file as i face similar problem and that worked – Karn Kumar Nov 12 '18 at 15:44
  • @pygo Thanks but when I try that I get "Expected 1 fields in line 32, saw 2. Error could possibly be due to quotes being ignored when a multi-char delimiter is used." – TJ15 Nov 12 '18 at 15:49
  • @jezrael Don't think I can upload the csv to here but tried to dump some into the description – TJ15 Nov 12 '18 at 15:51
  • @TJ15 - I think it is problem without data, here is necessary convert your data to gdocs, dropbox or similar, because from picture it is not possible see your problem. – jezrael Nov 12 '18 at 15:53
  • what if you try like : `tester = tester[~(tester == '?').any(axis=1)]` across the rows where any value is ? – Karn Kumar Nov 12 '18 at 15:54

2 Answers2

0

read_csv had a na_values parameter. See here.

df = pd.read_csv('date.csv', na_values='?')
ALollz
  • 57,915
  • 7
  • 66
  • 89
Alex
  • 6,610
  • 3
  • 20
  • 38
0

You are very near:

# IT looks like file is having spaces after comma, so use `sep`
tester = pd.read_csv('date.csv', sep=', ', engine='python')

tester['age'].replace('?', np.nan)

There seems problem with data somewhere so, for debug..

pd.read_csv('file', error_bad_lines=False)

tester = tester [~(tester == '?').any(axis=1)]

OR

 pd.read_csv('file', sep='delimiter', header=None)

OR

pd.read_csv('file',header=None,sep=', ')
Karn Kumar
  • 8,518
  • 3
  • 27
  • 53