0

I'm trying to remove outliers from a column in a pandas DataFrame.

Here's what my variable originally looks like (with the obvious outliers):

enter image description here

I then decide to delete anything that has a variation of +/-3 (since I know it shouldn't be possible to vary that much):

This works, and gives me NaN to replace the spikes:

enter image description here

But whenever I try to replace the now missing values by the previous observations, I somehow get some spikes back!

enter image description here

Would anyone know what I'm doing wrong?

Here's the whole code (in a while loop which goes indefinitely):

df = pd.DataFrame({'soc': [38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 127.0, 127.0, 66.48, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 127.0, 55.8, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0]})
while (abs(df['soc'].diff()) > 3).any():
    df['soc'] = np.where(abs(df['soc'].diff()) > 3, np.nan, df['soc'])
    df['soc'].fillna(method='ffill', inplace=True)
Laurent
  • 1,914
  • 2
  • 11
  • 25
  • Could you please add the code, so we can have the order in which is executed? Not the single lines, but the block of code please? Also in the second picture, the outlier is not removed, I can see a blue spot, with y ~= 120+ – Celius Stingher Nov 10 '20 at 16:28
  • 2
    Please [edit] to include a [mcve] including code, and sample input data and expected output. It's nearly impossible to tell what's actually happening form a picture of a plot. See [How to make good pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – G. Anderson Nov 10 '20 at 16:32
  • Made it more reproducible with sample data. – Laurent Nov 10 '20 at 16:53

1 Answers1

0

I believe you are not deleting the values with a deviation of more than 3, because in the second plot, I can still the a dot that shouldn't show up. Maybe you are assigning in the wrong column too. This is a generic example of what you intend to do that is working:

df = pd.DataFrame({'A':[100,110,105,104,103,102,101]})
df['A'] = np.where(abs(df['A'].diff()) > 3,np.nan,df['A'])
df['A'] = df['A'].fillna(method='ffill')

In this example, 110 and 105 should be removed since they have a deviation of more than 3 between each other, and they will be replaced with 100. The output:

       A
0  100.0
1  100.0
2  100.0
3  104.0
4  103.0
5  102.0
6  101.0
Celius Stingher
  • 17,835
  • 6
  • 23
  • 53
  • Thanks, I changed my text to add data to make it reproducible. I adopted your `np.where(abs(df['A'].diff()) > 3,np.nan,df['A'])` bit cause it was nicer than my original code, but it still doesn't fix the issue. – Laurent Nov 10 '20 at 16:52
  • I believe the problem might be that there are consecutive outliers (indexes 88 & 89), so one gives me a variation above 3, the next one doesn't. That's the reason why I included this in a loop, so the 2nd outlier should be deleted on the 2nd iteration – Laurent Nov 10 '20 at 17:07
  • You are right. How about you normalize and state that the values with a standard deviation of +/-3 are outliers? – Celius Stingher Nov 10 '20 at 17:18