0

I have data coming from a sensor that I store in a time serie.

When I graph them, I obtain:

raw data chart

These data are supposed to be "continuous", like temperatures, not going up and down so fast.

After searching similar issues on the web - I think "smoothen curve" have given me the more relevant results - I apply "convolution" to data, using code provided in this answer.

I obtain:

with convolution chart

It is not satisfying as I guess that some data points are just "wrong" and should be removed, not averaged.

Doing it by hand is quite easy as we can guess the curve:

by hand fixed chart

Here are the data and code to produce the second chart:

def smooth(y, box_pts):
    import numpy as np
    box = np.ones(box_pts)/box_pts
    return np.convolve(y, box, mode='same')


def load_data(f):
    from datetime import datetime as dt
    with open(f, "rt") as fd:
        X = []
        Y = []
        for line in fd.readlines():
            (x,y)=line.strip().split(" ")
            X.append(dt.fromtimestamp(int(x)))
            Y.append(float(y))
        return (X, Y)


import sys
(X,Y) = load_data(sys.argv[1])

from matplotlib.pyplot import plot, show
plot(X, Y,'b-')
plot(X, smooth(Y,19), 'g-', lw=2)
show()

I'm looking for an algorithm that would remove "bad" values, any idea ?

Setop
  • 2,262
  • 13
  • 28
  • I think this is this is off-topic, unfortunately. Try Statistics Stack Exchange? You also haven’t shared much about the data itself, which will certainly be important. – AMC Dec 22 '19 at 22:08
  • @AMC, I fixed the link to the data so the chart can be reproduced. I cross-posted to "Cross Validated" ; will see if it brings answer... – Setop Dec 23 '19 at 11:51

1 Answers1

0

warning this is a quick and dirty approach rather than based in statistics. Looking at your data the "bad" points vary alot compared to the rest of the data. Therefore if you look at the data in say 10 data point chunks and take their standard deviation the "bad" data should have a much higher std than the good data thus marking it for removal. numpy provides a quick way of calculating std here .


for i in range(len(Y)):
    std = np.std([ Y[i+j] for j in range(-5,5,1) if i+j >=0 and i+j <len(Y)])
    if std > 5:
        #mark for removal (don't remove here or it will screw up loop)