Method for missing data in rows?

Question

I am starting with programming. I would like to treat missing data by replacing it for the mean value of its column instead of its row-mean-value.

I'm trying to pre-process some data, from an Excel datasheet, similar to the one written as follows (the first column, the one with no title, is just the numbering/index; is not a real column):

    A    B    C    D   E  F
0  100  NaN  5.0  1.0  5  a
1  200  5.0  NaN  3.0  3  a
2  300  4.0  NaN  5.0  6  a
3  400  5.0  4.0  7.0  9  b
4  500  5.0  2.0  NaN  2  b
5  600  4.0  3.0  0.0  4  b

The point is that I would like to replace NaN for the mean value of its column. I am using pandas to extract the values from the datasheet. Then I try with sklearn.impute.SimpleImputer to treat these missing data but it just gives me the mean value of the rows.

dataset = pd.read_excel(io=file_name, sheet_name=sheet)
y = dataset.iloc[:, 0].values
X = dataset.iloc[:, 1:-1].values
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values = np.nan, strategy = 'mean')
imputer = imputer.fit(X[:, :])
X[:, :] = imputer.transform(X[:, :])

Does anyone know how could I do that in an efficient way? Is there any method of a class that treats this case? (I do not mind getting into another library).

Thank you very much

Before converting to `y` and `X` it's as simple as `dataset.fillna(dataset.mean())`. The index in `pandas` is a very powerful tool. — ALollz, Oct 11 '19 at 15:49

score 0 · Accepted Answer · answered Oct 11 '19 at 15:50

If you want to fill NaN values with the mean of the column, you can simply use:

data = {'A':[100,200,300,400,500,600],'B':[np.nan,5,4,5,5,4],'C':[5,np.nan,np.nan,4,2,3],'D':[1,3,5,7,np.nan,0]}
df = pd.DataFrame(data)
df = df.fillna(df.mean())
print(df)

Output:

     A    B    C    D
0  100  4.6  5.0  1.0
1  200  5.0  3.5  3.0
2  300  4.0  3.5  5.0
3  400  5.0  4.0  7.0
4  500  5.0  2.0  3.2
5  600  4.0  3.0  0.0

Method for missing data in rows?

1 Answers1