2

I have a dataframe where each column represents a geographic point, and each row represents a minute in a day. The value of each cell is the flow of water at that point in CFS. Below is a graph of one of these time-flow series.

Basically, I need to calculate the absolute value of the max flow at each of these locations during the day, which in this case would be that hump of 187 cfs. However, there are instabilities, so DF.abs().max() returns 1197 cfs. I need to somehow remove the outliers in the calculation. As you can see, there is no pattern to the outliers, but if you look at the graph, no 2 consecutive points in time should have more than an x% change in flow. I should mention that there are 15K of these points, so the fastest solution is the best.

Anyone know how can I accomplish this in python, or at least know the statistical word for what I want to do? Thanks!

enter image description here

enter image description here

  • 1
    Does this answer your question? [Detect and exclude outliers in Pandas data frame](https://stackoverflow.com/questions/23199796/detect-and-exclude-outliers-in-pandas-data-frame) – ScootCork Jul 06 '20 at 22:01
  • 1
    No, all those answers rely on Z score, standard deviation, or IQR. If you look at the graph above, while the true max is 187, there is an outlier with a value of 200, another of 150, etc, so any sensitivity cutoff wouldn't work. The solution needs to consider the points IN RELATION to their neighboring points. – openSourcerer Jul 06 '20 at 22:14
  • I'm sure there's some statistical word for it I've never heard and some scipy function will sort it out. – openSourcerer Jul 06 '20 at 22:15
  • Can You share sample data and manually annotate which points are outliers? I can see on the graph group of points that are far from main line. Do You consider it as a group of outliers? – ipj Jul 06 '20 at 22:39
  • OP, this is a really interesting and somewhat subtle problem if you look at it carefully. Essentially what you need to do is build a model of how the data should look when the sensor is working right, and then use that to classify points as being OK or strange in one or more ways. (And when you do classify points as strange, my advice is to output a separate report about them -- timestamps and values at which you found strange values, and how many there were.) First you need to sort out how to model OK/strange points, then think about calculations. Try stats.stackexchange.com about the model. – Robert Dodier Jul 07 '20 at 03:05

1 Answers1

0

In my opinion, the statistical word your are looking for is smoothing or denoising data.

Here is my try:

# Importing packages
import numpy as np
import matplotlib.pyplot as plt
from scipy.signal import savgol_filter

# Creating a curve with a local maximum to simulate "ideal data"
x = np.arange(start=-1, stop=1, step=0.001)
y_ideal = 10**-(x**2)

# Adding some randomly distributed outliers to simulate "real data"
y_real = y_ideal.copy()
np.random.seed(0)
for i in range(50):
    x_index = np.random.choice(len(x))
    y_real[x_index] = np.random.randint(-3, 5)

# Denoising with Savitzky-Golay (window size = 501, polynomial order = 3)
y_denoised = savgol_filter(y_real, window_length=501, polyorder=3)
# You should optimize these values to fit your needs

# Getting the index of the maximum value from the "denoised data"
max_index = np.where(y_denoised == np.amax(y_denoised))[0]

# Recovering the maximum value and reporting
max_value = y_real[max_index][0]
print(f'The maximum value is around {max_value:.5f}')

enter image description here

Please, keep in mind that:

  1. This solution is approximate.

  2. You should find the optimum parameters of the window_length and polyorder parameters plugged to the savgol_filter() function.

  3. If the region where your maximum is located is noisy, you can use max_value = y_denoised [max_index][0] instead of max_value = y_real[max_index][0].

Note: This solution is based in this other Stack Overflow answer

Gian Arauz
  • 423
  • 1
  • 7
  • 14