I have a dataframe, and have been asked to replace the outliers in the dataframe with the theoretical min/max. However, I'm not exactly sure what that means.
I think I have calculated the theoretical min/max--
outliers = pd.DataFrame(columns=['min', 'count below', 'max', 'count above'])
for col in df:
if pd.api.types.is_numeric_dtype(df[col]) and (len(df[col].value_counts()) > 0) and not all(df[col].value_counts().index.isin([0, 1])):
q1 = df[col].quantile(.25)
q3 = df[col].quantile(.75)
min = q1 - (1.5 * (q3 - q1))
max = q3 + (1.5 * (q3 - q1))
outliers.loc[col] = (min, df[col][df[col] < min].count(), max, df[col][df[col] > max].count())
These are a few rows of my dataframe:
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal num
0 28 1 2 130 132 0 2 185 0 0.0 NaN NaN NaN 0
1 29 1 2 120 243 0 0 160 0 0.0 NaN NaN NaN 0
2 29 1 2 140 NaN 0 0 170 0 0.0 NaN NaN NaN 0
3 30 0 1 170 237 0 1 170 0 0.0 NaN NaN 6 0
4 31 0 2 100 219 0 1 150 0 0.0 NaN NaN NaN 0
5 32 0 2 105 198 0 0 165 0 0.0 NaN NaN NaN 0
.
.
.
fbs
also contains 1
for a few values
exang
also contains 1
for a few values
oldpeak
also contains a few floats between 0
and 3
slope
is mostly NaN
but also contains 1
and 2
for some values
thal
is mostly NaN
but also contains 3
, 6
, and 7
for some values
num
also contains 1
for almost half of the values
So, now I'm not sure how to replace the outliers with the theoretical min/max.