I am starting with programming. I would like to treat missing data by replacing it for the mean value of its column instead of its row-mean-value.
I'm trying to pre-process some data, from an Excel datasheet, similar to the one written as follows (the first column, the one with no title, is just the numbering/index; is not a real column):
A B C D E F
0 100 NaN 5.0 1.0 5 a
1 200 5.0 NaN 3.0 3 a
2 300 4.0 NaN 5.0 6 a
3 400 5.0 4.0 7.0 9 b
4 500 5.0 2.0 NaN 2 b
5 600 4.0 3.0 0.0 4 b
The point is that I would like to replace NaN for the mean value of its column. I am using pandas to extract the values from the datasheet. Then I try with sklearn.impute.SimpleImputer to treat these missing data but it just gives me the mean value of the rows.
dataset = pd.read_excel(io=file_name, sheet_name=sheet)
y = dataset.iloc[:, 0].values
X = dataset.iloc[:, 1:-1].values
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values = np.nan, strategy = 'mean')
imputer = imputer.fit(X[:, :])
X[:, :] = imputer.transform(X[:, :])
Does anyone know how could I do that in an efficient way? Is there any method of a class that treats this case? (I do not mind getting into another library).
Thank you very much