0

I have been working with the Adult Census Dataset available at: https://archive.ics.uci.edu/ml/datasets/census+income

and for what I read it present some missing values marked with "?". I am building a classifier so I want to get replace those values with the mode, but I have found some problems with that. My source code is the following and I am putting comments on the issues that I have encountered:

import pandas as pd from sklearn import preprocessing import numpy as np

def open(fileR):
    head=["gt lt 50","age","workclass","fnlwgt","edu","edu-num","mar-sta","occ","rela","race","sex","cap-gain","cap-loss","country","hpw"]
    f=pd.read_csv(fileR,sep=',')
    f.columns=head
    f.replace('?',np.nan)   #I want to replace the ? values with nan 
    f = f.fillna(f.mode().iloc[:,1])        #replace the nan values with the mode
    print (f.iloc[:,1])

but the values that I got are still with the ? sign, for example:

25                 Private
26                       ?
27                 Private
28                 Private
29               Local-gov

I want to change all the ? values from the categorical variables of my f dataframe by using the mode, is there some step that I missing?

PD.

I have also tried the following for checking just one column:

    f.replace('?',np.nan,inplace=True)
    f = f.fillna(f.mode().iloc[:,1])
    print (f.iloc[:,1])

but still it prints the ? values.

Thanks

Little
  • 3,363
  • 10
  • 45
  • 74
  • 2
    assign it back `f=f.replace('?',np.nan)` ? – anky Aug 06 '19 at 14:27
  • or f.replace('?',np.nan,inplace=True) – BENY Aug 06 '19 at 14:27
  • @anky_91 I have tried that and with inplace also and still when I print the values of f the ? still appears – Little Aug 06 '19 at 14:31
  • @WeNYoBen thanks, but still it does not work – Little Aug 06 '19 at 14:36
  • 2
    @Little f=f.replace({'?':np.nan},regex=True) – BENY Aug 06 '19 at 14:36
  • it appears the error " nothing to repeat", this is weird – Little Aug 06 '19 at 14:38
  • f=f.apply(lambda x : x.str.strip()).replace('?',np.nan) – BENY Aug 06 '19 at 14:39
  • 1
    @Little This is not in fact a duplicate. Thanks for linking to the data. I had a look and the files do not use the standard comma delimiter; rather, they use a comma with a space after. So when you load the data, use `sep=', '` instead. The reason you couldn't replace the `'?'` strings is because they didn't exist in the data. They were actually `' ?'`. In fact, _all_ of the values had a space prefixing them. This will fix that issue. – brentertainer Aug 07 '19 at 00:39

0 Answers0