Modify outliers caused by sensor-failures in timeseries data

Question

I am working with timeseries data collected from a sensor at 5min intervals. Unfortunately, there are cases when the measured value (PV yield in watts) is suddenly 0 or very high. The values before and after are correct:

My goal is to identify these 'outliers' and (in a second step) calculate the mean of the previous and next value to fix the measured value. I've experimented with two approaches so far, but am receiving many 'outliers' which are not measurement-errors. Hence, I am looking for better approaches.

Try 1: Classic outlier detection with IQR Source

def updateOutliersIQR(group):
  Q1 = group.yield.quantile(0.25)
  Q3 = group.yield.quantile(0.75)
  IQR = Q3 - Q1
  outliers = (group.yield < (Q1 - 1.5 * IQR)) | (group.yield > (Q3 + 1.5 * IQR))
  print(outliers[outliers == True]) 

# calling the function on a per-day level
df.groupby(df.index.date).apply(updateOutliers)

Try 2: kernel density estimation Source

def updateOutliersKDE(group):
  a = 0.9
  r = group.yield.rolling(3, min_periods=1, win_type='parzen').sum()
  n = r.max()
  outliers = (r > n*a)
  print(outliers[outliers == True]) 

# calling the function on a per-day level
df.groupby(df.index.date).apply(updateOutliers)

Try 3: Median Filter Source (As suggested by Jonnor)

def median_filter(num_std=3):
  def _median_filter(x):
    _median = np.median(x)
    _std = np.std(x)
    s = x[-3]
    if (s >= _median - num_std * _std and s <= _median + num_std * _std):
      return s
    else:
      return _median
  return _median_filter

# calling the function
df.yield.rolling(5, center=True).apply(median_filter(2), raw=True)

Edit: with try 3 and a window of 5 and std of 3, it finally catches the massive outlier, but will also loose accuracy of the other (non-faulty) sensor-measurements:

Are there any better approaches to detect the described 'outliers' or perform smoothing in timeseries data with the occasional sensor measurement issue?

Have you tried a median filter? If there is just a couple of time-steps with errors, try a very short length, like 3 or 5. This can be done with pandas.rolling_median https://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.rolling_median.html — Jon Nordby, Jun 19 '20 at 15:50
Also, can you plot an example of your errors? Might be easier for people to see what could help in that case — Jon Nordby, Jun 19 '20 at 15:52
You can give a look at [IsolationForest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html) and this [paper](https://www.researchgate.net/publication/224384174_Isolation_Forest). I've used it in many projects. Works even better if you have multiple features — Hugolmn, Jun 21 '20 at 20:19
Thank you for pointing me toward the median filter. Just looking at the median with a window of 5 and std of 2 actually helps to smooth-out most of the erroneous sensor measurements. Is it correct that I now need to basically play around with the window-size and std to find the best fit? Or is there something more systematic I can do? I've updated my question accordingly — casaout, Jun 23 '20 at 13:15

Jon Nordby · Accepted Answer · 2020-06-23T20:44:37.517

Your abnormal values are abnormal in the sense that

the values deviate a lot from the values around it
the value changes very quickly from one time-step to the other

Thus what is needed is a filter that looks at a short time-context to filter these out.

One of the simplest and most effective is the median filter.

filtered = pandas.rolling_median(df, window=5)

The longer the window, the stronger the filter.

An alternative would be a low-pass filter. Though setting an appropriate cutoff frequency can be harder, and it will impose a smoothness onto the signal.

One can of course create more custom filters as well. For example, compute the first-order difference, and reject changes higher than a certain threshold. You can plot a histogram of the differences to determine a threshold. Mark these as missing (NaN), and then impute the missing using median/mean.

If your goal is Anomaly Detection, you can also use an Autoencoder. I would expect PV output to have a very strong daily pattern. So training it on daily sequences should work quite well (provided you have enough data). This is much more complicated than a simple filter, but has the advantage of being able to detect many other kinds of anomalies as well, not just the pattern identified here.

I fully agree to what jonnor stated. Regarding the detection, you can simply apply the `pandas.Series.diff()` function. Customized convolutional filters are also helpful. Both depend on thresholds for the decision and reduce the dynamics in the time-series. — Oliver Prislan, Sep 22 '20 at 21:45

Modify outliers caused by sensor-failures in timeseries data

1 Answers1