I am trying to replace missing values in a specific column in a dataframe, but having some issues. Have tried:
from sklearn.impute import SimpleImputer
fill_0_with_mean = SimpleImputer(missing_values=0, strategy='mean')
X_train['Age'] = fill_0_with_mean.fit_transform(X_train['Age'])
and
X_train[:,15] = fill_0_with_mean.fit_transform(X_train[:,15])
and
X_train[:,15:16] = fill_0_with_mean.fit_transform(X_train[:,15:16])
and
X_train['Age'] = fill_0_with_mean.fit_transform(X_train['Age'].values)
and
X_train[:,15:16] = fill_0_with_mean.fit_transform(X_train[:,15:16].values)
But I keep getting errors around
ValueError: Input contains NaN, infinity or a value too large for dtype('float64'). or IndexError: only integers, slices (:
), ellipsis (...
), numpy.newaxis (None
) and integer or boolean arrays are valid indices
I have zero and missing (NaN) values in my data. Can the imputer only do one of the two? How do I go about doing this? I have also tried casting my age column as an integer
X_train['Age'] = X_train['Age'].as_type('int32')
But this just gives me other errors.
my data looks like (the Age column)
Age | |
---|---|
0 | 31.0 |
1 | 79.0 |
2 | 53.0 |
3 | 40.0 |
4 | 55.0 |
... | |
44872 | NaN |
44873 | NaN |
44874 | NaN |
44875 | NaN |
44876 | NaN |
Is it possible that numpy and pandas are getting mixed up? I used this to split my data into training and testing:
from sklearn.model_selection import train_test_split
dep_var = ['is_overdue']
features = model_data2.columns
features = features.drop(dep_var)
print(features)
X = model_data2[features].values
Y = model_data2[dep_var].values
split_test_size = 0.30
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=split_test_size, random_state=42)
I'd greatly appreciate the help.