1

Let's say I have following dataframe contains value over time or date:

import pandas as pd

df = pd.DataFrame(data={'date':['2020-10-16','2020-10-17','2020-10-18','2020-10-19','2020-10-20','2020-10-21','2020-10-22','2020-10-23','2020-10-24','2020-10-25','2020-10-26','2020-10-27','2020-10-28','2020-10-29','2020-10-30','2020-10-31','2020-11-01','2020-11-02','2020-11-03','2020-11-04','2020-11-05','2020-11-06','2020-11-07','2020-11-08','2020-11-09','2020-11-10','2020-11-11','2020-11-12','2020-11-13','2020-11-14','2020-11-15'],
                        'value':[161967, 161270, 148508, 152442, 157504, 157118, 155674, 134522, 213384, 163242, 217415, 221502, 146267, 143621, 145875, 139488, 104466, 94825, 143686, 151952, 161074, 161417, 135042, 148768, 131428, 127816, 151905, 180498, 177899, 193950, 12]})
df

I inspired from this answer to detect peaks and valleys via below code:

from scipy.signal import find_peaks
import numpy as np
import matplotlib.pyplot as plt

# Input signal
t = df.date
x = df.value

# Threshold value (for height of peaks and valleys)
thresh = 0.95

# Find indices of peaks
peak_idx, _ = find_peaks(x, height=thresh, distance=10)

# Find indices of valleys (from inverting the signal)
valley_idx, _ = find_peaks(-x, height=thresh, distance=10 )

# Plot signal
plt.figure(figsize=(14,12))
plt.plot(t, x   , color='b', label='data')
plt.scatter(t, x, s=10,c='b',label='value')

# Plot threshold
plt.plot([min(t), max(t)], [thresh, thresh],   '--',  color='r', label='peaks-threshold')
plt.plot([min(t), max(t)], [-thresh, -thresh], '--',  color='g', label='valleys-threshold')

# Plot peaks (red) and valleys (blue)
plt.plot(t[peak_idx], x[peak_idx],     "x", color='r', label='peaks')
plt.plot(t[valley_idx], x[valley_idx], "x", color='g', label='valleys')

plt.xticks(rotation=45)
plt.ylabel('value')
plt.xlabel('timestamp')
plt.title(f'data over time for username=target')
plt.legend( loc='upper left')
plt.gcf().autofmt_xdate()
plt.show()
plt.show()

This is the output:

img

The problems:

  • I can't figure out how I can configure find_peaks() documentation to reach meaningful/drastic peaks & valley with respect to threshold as global outliers. I also checked this post but couldn't help me to find the cheap solution as well as other libraries offered here.
  • The upper threshold with red dashed is missing!
Mario
  • 1,631
  • 2
  • 21
  • 51

1 Answers1

0
  1. You need to specify height in the same domain as your data
  2. Upper thresohld is not missing, it is on the plot, just all those lines are close to 0 and clutter on the bottom.
thresh_top = np.median(x) + 1 * np.std(x)
thresh_bottom = np.median(x) - 1 * np.std(x)
# (you may want to use std calculated on 10-90 percentile data, without outliers)

# Find indices of peaks
peak_idx, _ = find_peaks(x, height=thresh_top)

# Find indices of valleys (from inverting the signal)
valley_idx, _ = find_peaks(-x, height=-thresh_bottom)

# Plot signal
plt.figure(figsize=(14,12))
plt.plot(t, x   , color='b', label='data')
plt.scatter(t, x, s=10,c='b',label='value')

# Plot threshold
plt.plot([min(t), max(t)], [thresh_top, thresh_top],   '--',  color='r', label='peaks-threshold')
plt.plot([min(t), max(t)], [thresh_bottom, thresh_bottom], '--',  color='g', label='valleys-threshold')

# Plot peaks (red) and valleys (blue)
plt.plot(t[peak_idx], x[peak_idx],     "x", color='r', label='peaks')
plt.plot(t[valley_idx], x[valley_idx], "x", color='g', label='valleys')

plt.xticks(rotation=45)
plt.ylabel('value')
plt.xlabel('timestamp')
plt.title(f'data over time for username=target')
plt.legend( loc='upper left')
plt.gcf().autofmt_xdate()
plt.show()

enter image description here

dankal444
  • 3,172
  • 1
  • 23
  • 35
  • Amazing ! It's a very good idea to set top/bottom thresholds by `np.median(x) -/+ 1 * np.std(x)` on 10-90 percentage of target data. Is there any way for further adjustment like on 25-75 percentage of data? May I use `2*np.std(x)` ? Actually, your configuration is perfect for this case, but if I need to optimize the parameters to get the best fit for thresholds in other cases, I'm asking. – Mario Nov 17 '21 at 00:01
  • as it is, calculation of standard deviation (`np.std(x)`) *may* (or may not) suffer from outliers - I prefer to limit data to 10-90 perecentile, calculate std and adjust to it threshold. It will not be exactly standard deviation then, something a bit smaller, but I do not care since I adjust threshold anyway. – dankal444 Nov 17 '21 at 00:11
  • The last question: Is there a special reason you set `width=0` for `peak_idx, _` and not for `valley_idx, _` ? or by default is `0`? in the above-mentioned documentation, it is mentioned *it is used for calculating the CWT matrix. In general, this range should cover the expected width of peaks of interest.* so, what does it mean in my case? – Mario Nov 17 '21 at 00:29
  • Oh, was playing with parameters and just forgot about it, removing.. – dankal444 Nov 17 '21 at 01:02
  • May I draw your attention kindly to another problem I faced in this context [here](https://stackoverflow.com/questions/70010260/problem-bug-with-list-values-reading-from-spark-dataframe-during-plotting-spikes). I couldn't figure out to solve it. I also provided with the google [Colab Notebook](https://colab.research.google.com/drive/13Fz__TUJSWpwVVTDVPRJeyvPsremKnfM?usp=sharing) for quick debugging. so feel free to check it out and run/test/edit it. – Mario Nov 17 '21 at 20:20
  • 1
    Vlad gave you good answer there – dankal444 Nov 17 '21 at 21:52