Filtering outliers from DataFrame

Question

I have a big problem filtering my data. I've read a lot here on stackoverflow and ion other pages and tutorials, but I could not solve my specific problem... The first part of my code, where I load my data into python looks as follow:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from arch import arch_model


spotmarket = pd.read_excel("./data/external/Spotmarket_dhp.xlsx", index=True)

r = spotmarket['Price'].pct_change().dropna()

returns = 100 * r
df = pd.DataFrame(returns)

The excel table has 43.000 values in one column and includes the hourly prices. I use this data to calculate the percentage change from hour to hour and the problem is, that there are sometimes big changes between 1000 to 40000%. The dataframe looks as follow:

df
Out[12]: 
              Price
1         20.608229
2         -2.046870
3          6.147789
4         16.519258
             ...
43827    -16.079874
43828     -0.438322
43829    -40.314465
43830   -100.105374
43831    700.000000
43832    -62.500000
43833 -40400.000000
43834      1.240695
43835     52.124183
43836     12.996778
43837    -17.157795
43838    -30.349971
43839      6.177924
43840     45.073701
43841     76.470588
43842      2.363636
43843     -2.161042
43844     -6.444781
43845    -14.877102
43846      6.762918
43847    -38.790036
[43847 rows x 1 columns]

I wanna exclude this outliers. I've tried different ways like calculating the meanand the std and exclude all values which are + and - three times the std away from the mean. It works for a small part of the data, but for the complete data, the mean and std are both NaN. Has someone an idea, how I can filter my dataframe?

Do you try [this](https://stackoverflow.com/a/23200666/2901002) ? — jezrael, Jun 09 '18 at 14:34
@jezrael yes I've tried this, but it doesn't work. I'm not sure, but it could be, that I had a fault in my references...Can you give me maybe a code example with that approach? — Pyrmon55, Jun 09 '18 at 14:43
hmmm, it loooks like some data dependent issue, so is possible share your data if not confidental? Need only Price column, another columns is possible remove — jezrael, Jun 09 '18 at 14:46
yes I can share them with you. Where can I provide the data? — Pyrmon55, Jun 09 '18 at 14:51

score 2 · Accepted Answer · answered Jun 09 '18 at 15:36

2

I think need filter by percentiles by quantile:

r = spotmarket['Price'].pct_change() * 100

Q1 = r.quantile(.25)
Q3 = r.quantile(.75)
q1 = Q1-1.5*(Q3-Q1)
q3 = Q3+1.5*(Q3-Q1)

df = spotmarket[r.between(q1, q3)]

answered Jun 09 '18 at 15:36

jezrael

822,522
95
1,334
1,252

score 0 · Answer 2 · answered Jun 09 '18 at 16:57

0

may you should first discard all the values that are giving those fluctuations and then create the dataframe. One way is to use the filter()

answered Jun 09 '18 at 16:57

Tayyab

47
1
2
8

Filtering outliers from DataFrame

2 Answers2