I am currently trying to remove the outlier values from my dataset, using the median absolute deviation method.
To do so, I followed the instructions given by @tanemaki in Detect and exclude outliers in Pandas data frame, which enables the deletion of entire rows that hold at least one outlier value.
In the post I linked, the same question was asked, but was not answered.
The problem is that I only want the outliers to be searched in a single column.
So, for example, my dataframe looks like:
Temperature Date
1 24.72 2.3
2 25.76 4.6
3 25.42 7.0
4 40.31 9.3
5 26.21 15.6
6 26.59 17.9
For example, there are two 'anomalies in the data:
- The Temperature value in row [4]
- The Date value in row [5]
So, what I want is for the outlier function to only 'notice' the anomaly in the Temperature column, and delete its corresponding row.
The outlier code I am using is:
df=pd.read_excel(r'/home/.../myfile.xlsx')
from scipy import stats
df[pd.isnull(df)]=0
dfn=df[(np.abs(stats.zscore(df))<4).all(axis=1)] #@taneski
print(dfn)
And my resulting data frame currently looks like:
Temperature Date
1 24.72 2.3
2 25.76 4.6
3 25.42 7.0
6 26.59 17.9
In case I am not getting my message across, the desired output would be:
Temperature Date
1 24.72 2.3
2 25.76 4.6
3 25.42 7.0
5 26.21 15.6
6 26.59 17.9
Any pointers would be of great help. Thanks!