0

I have a dataset whose miss data is shown by ? (not by NaN). I want to replace them with mean of its column. For example my dataset is like this:

0,1,2,3 
1,2,5,1.2 
2,4,8,2.3 
3,5,?,1 

I want to replace ? with (2+5+8)/3=5. So data will be like this:

0,1,2,3 
1,2,5,1.2 
2,4,8,2.3 
3,5,5,1 

I write this code based on this page and this question.

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
dataset_dataframe = pd.read_csv(DATASET_PATH, header = None)
for i in range(0 , len(dataset_dataframe.columns)-1):
    if dataset_dataframe[i].dtype != np.number:
        dataset_dataframe[i] = dataset_dataframe[i].replace('?' , np.nan)
        print("%s -\n %s" %(i , dataset_dataframe[i]))
        imputer_miss_data = SimpleImputer(missing_values=np.nan, strategy='mean')
        corrected_column = imputer_miss_data.fit_transform(dataset_dataframe[i])
        dataset_dataframe[i]=corrected_column
        print(dataset_dataframe[i])

but it doesn't work. What should I do to replace miss data, which is shown as? in dataset, with mean of its column using SimpleImputer?

Atefeh Rashidi
  • 485
  • 1
  • 8
  • 32

1 Answers1

1

I am not sure which error you have, most likely it is ValueError because SimpleImputer requires a 2D input. This is a working example based on your code (note the reshape):

 df = pd.DataFrame(data=[[0,1,2,3],
                            [1,2,5,1.2 ],
                            [2,4,8,2.3 ],
                            ['?',5,'?',1 ]],
                      columns = ['a','b','c','d'])
    df = df.replace('?' , np.nan)

for col in df.columns:
    if any(df[col].isna()):
        imputer_miss_data = SimpleImputer(missing_values=np.nan, strategy='mean')
        corrected_column = imputer_miss_data.fit_transform(df[col].values.reshape(-1, 1))
        df[col] = corrected_column

However there is no need to iterate columns. Just apply imputer to the whole dataframe:

imputer_miss_data = SimpleImputer(missing_values=np.nan, strategy='mean')
df = imputer_miss_data.fit_transform(df)
Poe Dator
  • 4,535
  • 2
  • 14
  • 35