Remove peak lines

Question

I have a data file in which I save datetime, pH, and Temperature. Once in a while, the Temperature misses one digit as shown below:

12-08-2017_14:52:21 Temp: 28.9 pH: 7.670
12-08-2017_14:52:42 Temp: 28.9 pH: 7.672
12-08-2017_14:53:03 Temp: 28.9 pH: 7.672
12-08-2017_14:53:24 Temp: 8.91 pH: 7.667
12-08-2017_14:53:45 Temp: 28.9 pH: 7.667
12-08-2017_14:54:06 Temp: 28.9 pH: 7.669
12-08-2017_14:54:27 Temp: 28.9 pH: 7.671

I'd like to remove the whole line with the error. I've found some solutions like this, but I don't understand how to implement it in python. Is there any specific way I should do it, either in python or bash?

You can use some oultier detection procedure, like this https://stackoverflow.com/questions/11686720/is-there-a-numpy-builtin-to-reject-outliers-from-a-list — DYZ, Aug 15 '17 at 01:06
Can you describe the "error" case any more precisely? Will good temperatures always have two leading digits? Never greater than 99? Negative? — Jeff Schaller, Aug 15 '17 at 10:22

score 1 · Answer 1 · answered Aug 15 '17 at 01:29

It depends a lot on the behavior you desire and the complexity of the solution required. From the data you posted, I would say that you can try computing the difference from the last measurement and rejecting measurements which have a difference over threshold degrees from the previous one. Just a quick and dirty example of this:

THRESHOLD = 10
lastTemp = None

while True:
    line = raw_input().split()
    temp = float(line[2])

    if not lastTemp:
       lastTemp = temp

    if abs(temp - lastTemp) > THRESHOLD:
       continue

    # Process the line here
    print line

This skips lines with measurements with 10 degrees difference from the previous one. It is suitable if measurements are taken at small enough intervals and no large temperature changes are expected.

A small improvement over this would be to consider the last few measurements, compute a prediction for the next measurement (if only 2 are considered, simply do a derivative - for more points a bit more complicated approaches are required), reject if the value is too far away from the prediction. Alternatively, more complex statistical approaches can be used.

score 0 · Answer 2 · answered Aug 15 '17 at 10:57

Depending on how complex your actual data is, a simple awk solution could be:

awk '$3 >= 10 {print}' data

which, on your sample data, returns:

12-08-2017_14:52:21 Temp: 28.9 pH: 7.670
12-08-2017_14:52:42 Temp: 28.9 pH: 7.672
12-08-2017_14:53:03 Temp: 28.9 pH: 7.672
12-08-2017_14:53:45 Temp: 28.9 pH: 7.667
12-08-2017_14:54:06 Temp: 28.9 pH: 7.669
12-08-2017_14:54:27 Temp: 28.9 pH: 7.671

If your temperatures could be negative, such as these sample additions:

12-08-2017_14:54:27 Temp: -28.9 pH: 7.671
12-08-2017_14:54:27 Temp: -2.9 pH: 7.671

Then broaden the awk test:

awk '$3 >= 10 || $3 <= -10 {print}' data

Remove peak lines

2 Answers2