Remove outliers (+/- 3 std) and replace with np.nan in Python/pandas

Question

I have seen several solutions that come close to solving my problem

but they have not helped me succeed thus far.

I believe that the following solution is what I need, but continue to get an error (and I don't have the reputation points to comment/question on it): link

(I get the following error, but I don't understand where to .copy() or add an "inplace=True" when administering the following command df2=df.groupby('install_site').transform(replace):

SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: link

SO, I have attempted to come up with my own version, but I keep getting stuck. Here goes.

I have a data frame indexed by time with columns for site (string values for many different sites) and float values.

time_index            site       val

I would like to go through the 'val' column, grouped by site, and replace any outliers (those +/- 3 standard deviations from the mean) with a NaN (for each group).

When I use the following function, I cannot index the data frame with my vector of True/Falses:

def replace_outliers_with_nan(df, stdvs):
    dfnew=pd.DataFrame()
    for i, col in enumerate(df.sites.unique()):
        dftmp = pd.DataFrame(df[df.sites==col])
        idx = [np.abs(dftmp-dftmp.mean())<=(stdvs*dftmp.std())] #boolean vector of T/F's
        dftmp[idx==False]=np.nan  #this is where the problem lies, I believe
        dfnew[col] = dftmp
    return dfnew

In addition, I fear the above function will take a very long time on 7 million+ rows, which is why I was hoping to use the groupby function option.

That error you get is just a warning. It seems to sometimes come up even when the operation succeeds. Did you check whether using that method actually works despite the message? — BrenBarn, Apr 20 '15 at 06:02

score 6 · Accepted Answer · answered Apr 20 '15 at 06:53

6

If I have understood you right, there is no need to iterate over the columns. This solution replaces all values which deviates more than three group standard deviations with NaN.

def replace(group, stds):
    group[np.abs(group - group.mean()) > stds * group.std()] = np.nan
    return group

# df is your DataFrame
df.loc[:, df.columns != group_column] = df.groupby(group_column).transform(lambda g: replace(g, 3))

answered Apr 20 '15 at 06:53

RickardSjogren

4,070
3
17
26

Thanks so much @RickardSjogren. This works great! I do get the indexing-view-versus-copy error, but it works in the end. Much appreciated!! – Nicole Goebel Apr 21 '15 at 19:52
could you suggest a way to apply this independently to each column of a dataframe (i.e., replace values that are > 3std from the mean for each column)? I am struggling to get it working and I keep ending up with an empty dataframe. – Nicole Goebel Jun 12 '15 at 17:58
I was hoping you could help me with the above, @RickardSjogren – Nicole Goebel Jun 12 '15 at 18:54
This solution should perform the transformation column-wise. A dataframe filled with NaN:s might be due to missing values in your original data which messes up the `mean` and `std`-calls. Without seeing your data I'm not able to troubleshoot though. – RickardSjogren Jun 15 '15 at 08:31
2

How is `group_column` defined? – Clayton Jun 23 '15 at 15:05
It's the identifier of the column with group belongings which you want to use, e.g. "class". – RickardSjogren Jun 23 '15 at 16:56

Remove outliers (+/- 3 std) and replace with np.nan in Python/pandas

1 Answers1

Linked