1

I am having the below data frame which is a time-series data and I process this information to input to my prediction models.

df = pd.DataFrame({"timestamp": [pd.Timestamp('2019-01-01 01:00:00', tz=None),
                               pd.Timestamp('2019-01-01 01:00:00', tz=None),
                               pd.Timestamp('2019-01-01 01:00:00', tz=None),
                               pd.Timestamp('2019-01-01 02:00:00', tz=None),
                               pd.Timestamp('2019-01-01 02:00:00', tz=None),
                               pd.Timestamp('2019-01-01 02:00:00', tz=None),
                               pd.Timestamp('2019-01-01 03:00:00', tz=None),
                               pd.Timestamp('2019-01-01 03:00:00', tz=None),
                               pd.Timestamp('2019-01-01 03:00:00', tz=None)],
                   "value":[5.4,5.1,100.8,20.12,21.5,80.08,150.09,160.12,20.06]

                  })

From this, I take the mean of the value for each timestamp and will send the value as the input to the predictor. But currently, I am using just thresholds to filter out the outliers,but those seem to filter out real vales and also not filter some outliers .

For example, I kept

df[(df['value']>3 )& (df['value']<120 )]

and then this does not filter out

2019-01-01 01:00:00 100.8

which is an outlier for that timestamp and does filter out

2019-01-01 03:00:00 150.09
2019-01-01 03:00:00 160.12

which are not outliers for that timestamp.

So how do I filter out outliers for each timestamp based on which does not fit that group?

Any help is appreciated.

Ricky
  • 2,662
  • 5
  • 25
  • 57
  • 1
    what is an outlier for you? I mean what is your statisical measurement? confidence interval? give us a key word for it – PV8 Jun 17 '20 at 13:00
  • whether a value differs much from the rest of the group is considered as an outlier in this case. For example `2019-01-01 01:00:00` most values are around 5 but one is round 100 so that is an outlier for me – Ricky Jun 18 '20 at 04:04
  • that is still a broad question, but lets save it as confidence interval. – PV8 Jun 18 '20 at 12:14

1 Answers1

1

Ok, let's assume you are searching for the confidence interval to detect outlier.

Then you have to get the mean and the confidence intervals for each timestamp group. Therefore you can run:

stats = df.groupby(['timestamp'])['value'].agg(['mean', 'count', 'std'])
ci95_hi = []
ci95_lo = []
import math
for i in stats.index:
    m, c, s = stats.loc[i]
    ci95_hi.append(m + 1.96*s/math.sqrt(c))
    ci95_lo.append(m - 1.96*s/math.sqrt(c))

stats['ci95_hi'] = ci95_hi
stats['ci95_lo'] = ci95_lo
df = pd.merge(df, stats, how='left', on='timestamp')

which leads to the following output: enter image description here

then you can adjust a filter column:

import numpy as np
df['Outlier'] = np.where(df['value'] >= df['ci95_hi'], 1, np.where(df['value']<= df['ci95_lo'], 1, 0))

then everythign with a 1 in the column outlier is an outlier. You can adjust the values with 1.96 to play a little with it.

The outcome looks like: enter image description here

PV8
  • 5,799
  • 7
  • 43
  • 87