0

I have a dataframe, and have been asked to replace the outliers in the dataframe with the theoretical min/max. However, I'm not exactly sure what that means.

I think I have calculated the theoretical min/max--

outliers = pd.DataFrame(columns=['min', 'count below', 'max', 'count above'])

for col in df:
  if pd.api.types.is_numeric_dtype(df[col]) and (len(df[col].value_counts()) > 0) and not all(df[col].value_counts().index.isin([0, 1])):

    q1 = df[col].quantile(.25)
    q3 = df[col].quantile(.75)
    min = q1 - (1.5 * (q3 - q1))
    max = q3 + (1.5 * (q3 - q1))

    outliers.loc[col] = (min, df[col][df[col] < min].count(), max, df[col][df[col] > max].count())

These are a few rows of my dataframe:

    age sex cp  trestbps    chol    fbs restecg thalach exang   oldpeak slope   ca  thal    num
  0 28  1   2        130    132       0       2 185         0   0.0       NaN   NaN  NaN    0
  1 29  1   2        120    243       0       0 160         0   0.0       NaN   NaN  NaN    0
  2 29  1   2        140    NaN       0       0 170         0   0.0       NaN   NaN  NaN    0
  3 30  0   1        170    237       0       1 170         0   0.0       NaN   NaN    6    0
  4 31  0   2        100    219       0       1 150         0   0.0       NaN   NaN  NaN    0
  5 32  0   2        105    198       0       0 165         0   0.0       NaN   NaN  NaN    0
  .
  .
  .

fbs also contains 1 for a few values

exang also contains 1 for a few values

oldpeak also contains a few floats between 0 and 3

slope is mostly NaN but also contains 1 and 2 for some values

thal is mostly NaN but also contains 3, 6, and 7 for some values

num also contains 1 for almost half of the values

So, now I'm not sure how to replace the outliers with the theoretical min/max.

mathmajor
  • 125
  • 8
  • What's the "theoretical min/max"? That depends on the "theory" and requires knowledge of what the variables mean and the subject area. What's the max age? The min age is 0, but if those are car drivers than it might be higher. – Josef Apr 14 '20 at 17:33

1 Answers1

0

You're going to have to figure out what constitutes an outlier for your purposes. I'm a programmer not a statistician, but I suspect anything that falls outside the theoretical min/max fits the bill.

As for actually replacing the outlier... you may want to check out the answer to this post. Conditional Replace Pandas

Having said that, the code below might get you going.

df.loc[df[col] > outliers.loc[col]['max'], df[col]] = outliers.loc[col]['max']
df.loc[df[col] < outliers.loc[col]['min'], df[col]] = outliers.loc[col]['min']

Re-reading the question, it sounds like you may be looking for more information on what constitutes an outlier, and when you have enough data to be statistically significant. If that's the case: Please consider adding some additional tags to your question.