Create custom parameter to find outliers in pandas dataframe

Question

I have 2 dataframes that i built using pandas. If you look at the graph below you can see that both of my data frames follow pretty much the same data patern. I want to have pandas tell me when my data falls outside of a certain parameter. For example: say i wanted to know when on the x axis the data falls below 2 or above 4 on the y axis. I know that i can get pandas to eliminate outliers using a standard deviation curve and i'm also able to print out the outliers to an excel file. But that wont work for this data i dont want to remove any data i just want to know where all of the outliers are at. Ive tried creating a Boolean index like this df4[(df4 < 2) | (df4 > 4)] but this just erases the data values below 2 and above 4. My question is this: How can i set up my own parameter to determine outliers using pandas without removing data?

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn
plt.style.use("dark_background")
plt.style.use("seaborn-bright")



x4 = (e[0].time[:47172])
y4 = (e[0].data.f[:47172])

x6 = (t[0].time[:47211])
y6 = (t[0].data.f[:47211])

df4 = pd.DataFrame({'Time': x4, 'Data': y4})
df6 = pd.DataFrame({'Time': x6, 'Data': y6})
plt.xlabel('Relative Time in Seconds', fontsize=12)
plt.ylabel('Data', fontsize=12)
plt.grid(linestyle = 'dashed')

plt.plot(x4, y4)
plt.plot(x6, y6)
plt.show()

Oleg Medvedyev · Accepted Answer · 2017-08-07T14:24:49.940

0

You actually did it already. When you do df4[(df4 < 2) | (df4 > 4)] it does not "erase" data, it just shows only those records which satisfy the criteria, in other words you see only the subset of the dataframe. If you want to see the whole dataframe you can just add a new column:

df['outlier'] = (df4['Data'] < 2) | (df4['Data'] > 4)

Then you can see the whole dataframe by simply df and the column outlier will be True for outliers. If you want to look only at outliers: df[df.outlier] or non-outliers: df[~df.outlier]. Likewise you can even color-code outliers on your plot using outlier column as indication of the color.

edited Aug 07 '17 at 14:24

answered Aug 07 '17 at 14:11

Oleg Medvedyev

1,574
14
16

When I type `df4['outlier'] = (df4 < 2 ) | (df4 > 4)` after `df4 = pd.DataFrame({'Time': x4, 'Data': y4})` I get an error: ValueError: Wrong number of items passed 2, placement implies 1 – eliza.b Aug 07 '17 at 14:22
Sure. You have to specify what column should be used in logical comparison. So if you want it to be based on "Data" then it should be `(df4["Data"] < 2 ) | (df4["Data"] > 4)` – Oleg Medvedyev Aug 07 '17 at 14:26
That works now! Your last sentence is interesting to me though. I hadn't even considered highlighting them on the graph. How can I use the outlier column as a color indicator? – eliza.b Aug 07 '17 at 14:27
There are several different ways to do it. One would be to use scatter plots and plot outliers and non-outliers separately, like in this example -https://stackoverflow.com/questions/40333033/how-to-change-outliers-to-some-other-colors-in-a-scatter-plot – Oleg Medvedyev Aug 07 '17 at 16:44

Create custom parameter to find outliers in pandas dataframe

1 Answers1