-1

I have some data points (size of clusters) and I would like to calculate the average of these points, however, some instant peaks need to be removed. Usually, these peaks are twice or three times the normal value, but not always. Any suggestion would be appreciated. Thank you.

Some instant peaks because of coalescence:

eiche
  • 3
  • 2
  • 1
    Please, provide a [Minimal Reproducible Example](https://stackoverflow.com/help/minimal-reproducible-example). What data type are your values? – Vladimir Fokow Aug 15 '22 at 04:47
  • e.g., you could remove a data point if it is > than the (rolling average * 2) of let's say, 10 observations. Related: [Outlier detection based on the moving mean in Python](https://stackoverflow.com/q/62692771/14627505) – Vladimir Fokow Aug 15 '22 at 04:49
  • Does this answer your question? [Detect and exclude outliers in a pandas DataFrame](https://stackoverflow.com/questions/23199796/detect-and-exclude-outliers-in-a-pandas-dataframe) – Vladimir Fokow Aug 15 '22 at 05:48
  • Please provide enough code so others can better understand or reproduce the problem. – Community Aug 15 '22 at 06:27

1 Answers1

0

Assuming you have your data in a dataframe with two columns: 'time' and 'size', and that there are around 500 observations in total (so the window size 10 is sensible):

Calculate the median of a moving window.

If for some value (the median centered at it * multiplier_thresh) is >= its 'size', then consider this value an outlier and remove it:

wind_size = 10
multiplier_thresh = 1.5

# Calculate rolling median
rolling_median = df['size'].rolling(window=wind_size).median().bfill()

# Drop outliers
to_stay = df['size'] < rolling_median * multiplier_thresh
df_no_outliers = df[to_stay]

Mean of the values without the outliers:

df_no_outliers['size'].mean()

A simpler approach:

Just remove the outliers of all your 'size' values.

You can use a variety of methods to detect and remove the outliers.

Here is a simple one:

q1 = df["size"].quantile(0.25)
q3 = df["size"].quantile(0.75)
iqr = q3 - q1  # Interquartile range

df_no_outliers = df[df["size"] < q3 + 1.5 * iqr]
Vladimir Fokow
  • 3,728
  • 2
  • 5
  • 27