How to replace scalar outliers (> X * standard deviation from mean) in numerical features of mixed Pandas dataframe with numpy not a number np.NaNs?

Question

how can I get this to work while keeping the entire rows intact?

I only want to replace the individual scalar outlier cell values.

Please help. My current code doesn't seem to have any effect?

import pandas as pd
df = pd.DataFrame({'user': ['Bob', 'Jane', 'Alice'], 
                   'income': [1, 1, 42000]})
# expected to replace 42000 by np.NaN

def remove_outliers(df):
    from scipy import stats
    for col in df.select_dtypes(include=['number']).columns:
        outliers = df.loc[stats.zscore(df[col]) > remove_outliers_above_absolute_standard_deviation_of, col]
        display("Outliers found: " + str(len(outliers.index)))
        df.loc[stats.zscore(df[col]) > remove_outliers_above_absolute_standard_deviation_of, col] = np.nan
    display("Dataframe after outlier removal: ")
    display(df)
    return df

remove_outliers(df)

Returns identical df without any filtering of outliers:

Thank you.

I already checked https://stackoverflow.com/questions/67884748/how-to-replace-scalar-outliers-x-standard-deviation-from-mean-in-numerical and https://stackoverflow.com/questions/55801017/how-to-replace-outliers-with-nan-while-keeping-row-intact-using-pandas-in-python but it didn't help me. — ABC, Jun 08 '21 at 09:14
Please add a data example (for [tips](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples)) — nocibambi, Jun 08 '21 at 09:29

score 0 · Answer 1 · answered Jun 08 '21 at 09:38

0

# get the numeric columns as a copy to work on
numerics = df.select_dtypes("number").copy()

# get their means & stds
means, stds = numerics.mean(), numerics.std()

# determine the lower and upper bounds for "outlier"s
factor = 3
lower, upper = means - factor * stds, means + factor * stds

# mask those that are out of (lower, upper) as `np.nan`
numerics[~(lower.lt(numerics) & upper.gt(numerics))] = np.nan

# put the numerics back
df[numerics.columns] = numerics

answered Jun 08 '21 at 09:38

Mustafa Aydın

17,645
4
15
38

Hi Mustafa, thanks for the swift reply. I tried this solution on my example df but it doesn't change the df outliers. I found that already the numerics variable is an empty dataframe when I `display(numerics)` even though I set the dtype via `df.astype({'income': 'float64'}).dtypes` – ABC Jun 08 '21 at 10:40
Hi @BenJordan, If we look at the zcores of the `income`, we get `[-0.70710678, -0.70710678, 1.41421356]`. Therefore for `42_000` to be removed, the `factor` in the above code i.e., threshold could be 1. It is 3 now and with this criterion, it doesn't consider it an outlier. As for the empty `numerics` issue: I can't reproduce. If I copy your `df` and define `numerics` over it, it shows the `income` column. What does `pd.DataFrame({'user': ['Bob', 'Jane', 'Alice'], 'income': [1, 1, 42000]}).select_dtypes("number")` give? – Mustafa Aydın Jun 08 '21 at 11:33

nocibambi · Answer 2 · 2021-06-08T11:38:41.680

Based on your example, this works.

import pandas as pd
import numpy as np

df = pd.DataFrame({"user": ["Bob", "Jane", "Alice"], "income": [1, 1, 42000]})
display(df)
>>>
    user    income
0   Bob     1
1   Jane    1
2   Alice   42000
>>>
for col in df.select_dtypes('number'):
    df.loc[
        (df[col] > df[col].mean() + df[col].std())
        | (df[col] < df[col].mean() - df[col].std()),
        col,
    ] = np.NaN

display(df)
>>>
    user    income
0   Bob     1.0
1   Jane    1.0
2   Alice   NaN

Although it is not that different than @Mustafa Aydin's answer. So I suspect that either your data has some type issue or zscore does not produce what you expect it should.

With zscore, this produces the same result:

for col in df.select_dtypes('number'):
    df.loc[stats.zscore(df[col]) > 1, col] = np.NaN

score 0 · Answer 3 · answered Jun 08 '21 at 16:24

This should work:

def remove_outliers(df, max_outlier_std):
    for column in df.select_dtypes('number'):
        outlier_condition = (df[column] > df[column].mean() + max_outlier_std*df[column].std()) | (df[column] < df[column].mean() - max_outlier_std*df[column].std())
        outliers_snapshot = pd.DataFrame([df.loc[
            outlier_condition,
            column
        ].copy().rename(f"{column} outliers")]).transpose()
        df.loc[
            outlier_condition,
            column
        ] = np.nan
        display(outliers_snapshot)
        return df

How to replace scalar outliers (> X * standard deviation from mean) in numerical features of mixed Pandas dataframe with numpy not a number np.NaNs?

3 Answers3