Deleting entire rows of a dataset for outliers found in a single column

Question

I am currently trying to remove the outlier values from my dataset, using the median absolute deviation method.

To do so, I followed the instructions given by @tanemaki in Detect and exclude outliers in Pandas data frame, which enables the deletion of entire rows that hold at least one outlier value.

In the post I linked, the same question was asked, but was not answered.

The problem is that I only want the outliers to be searched in a single column.

So, for example, my dataframe looks like:


          Temperature    Date       
    1        24.72        2.3        
    2        25.76        4.6        
    3        25.42        7.0        
    4        40.31        9.3        
    5        26.21       15.6
    6        26.59       17.9

For example, there are two 'anomalies in the data:

The Temperature value in row [4]
The Date value in row [5]

So, what I want is for the outlier function to only 'notice' the anomaly in the Temperature column, and delete its corresponding row.

The outlier code I am using is:

df=pd.read_excel(r'/home/.../myfile.xlsx')
from scipy import stats
df[pd.isnull(df)]=0
dfn=df[(np.abs(stats.zscore(df))<4).all(axis=1)] #@taneski
print(dfn)

And my resulting data frame currently looks like:


          Temperature    Date       
    1        24.72        2.3        
    2        25.76        4.6        
    3        25.42        7.0               
    6        26.59       17.9

In case I am not getting my message across, the desired output would be:


          Temperature    Date       
    1        24.72        2.3        
    2        25.76        4.6        
    3        25.42        7.0  
    5        26.21       15.6         
    6        26.59       17.9

Any pointers would be of great help. Thanks!

score 3 · Accepted Answer · answered May 15 '20 at 08:52

3

You can always limit the stats.zscore operation on only the Temperature column instead of the whole df. Like this maybe:

In [573]: dfn = df[(np.abs(stats.zscore(df['Temperature']))<4)]                                                                                                                                             

In [574]: dfn                                                                                                                                                                                               
Out[574]: 
   Temperature  Date
1        24.72   2.3
2        25.76   4.6
3        25.42   7.0
5        26.21  15.6
6        26.59  17.9

answered May 15 '20 at 08:52

Mayank Porwal

33,470
8
37
58

How would you go about doing the same, but in the basis of finding outliers for two columns instead of one? I tried by (['Temperature'],['Date']) and had no luck there. Thanks! – enricw May 25 '20 at 06:13
You can do this: `df[(df['Temperature'] < 4) & df['Date'] == '2020-05-20')]`. This will the dataframe based on both conditions. – Mayank Porwal May 25 '20 at 07:05
Is the solution in your original answer conditional to the situation or something? I am trying to use it for another spreadsheet which has a higher number of columns and it does not work (I edited the post with the problem) – enricw Jun 22 '20 at 08:07

score 3 · Answer 2 · answered May 15 '20 at 08:54

3

At the moment, you're calculating the zscores for the whole dataframe and then filtering the dataframe with those calculated scores; what you want to do is just apply the same idea to one column.

Instead of

dfn=df[(np.abs(stats.zscore(df))<4).all(axis=1)]

You want to have

df[np.abs(stats.zscore(df["Temperature"])) < 4]

As a side note, I found that I was unable to get your example results by comparing the zscores to 4; I had to switch it down to 2.

answered May 15 '20 at 08:54

Peritract

761
5
13

Is the solution in your answer conditional to the situation or something? I am trying to use it for another spreadsheet which has a higher number of columns and it does not work (I edited the post with the problem) – enricw Jun 22 '20 at 08:11

Deleting entire rows of a dataset for outliers found in a single column

2 Answers2