395

Update: The best performing algorithm so far is this one.


This question explores robust algorithms for detecting sudden peaks in real-time timeseries data.

Consider the following example data:

Plot of data

Example of this data is in Matlab format (but this question is not about the language but about the algorithm):

p = [1 1 1.1 1 0.9 1 1 1.1 1 0.9 1 1.1 1 1 0.9 1 1 1.1 1 1 1 1 1.1 0.9 1 1.1 1 1 0.9, ...
     1 1.1 1 1 1.1 1 0.8 0.9 1 1.2 0.9 1 1 1.1 1.2 1 1.5 1 3 2 5 3 2 1 1 1 0.9 1 1, ... 
     3 2.6 4 3 3.2 2 1 1 0.8 4 4 2 2.5 1 1 1];

You can clearly see that there are three large peaks and some small peaks. This dataset is a specific example of the class of timeseries datasets that the question is about. This class of datasets has two general features:

  1. There is basic noise with a general mean
  2. There are large 'peaks' or 'higher data points' that significantly deviate from the noise.

Let's also assume the following:

  • The width of the peaks cannot be determined beforehand
  • The height of the peaks significantly deviates from the other values
  • The algorithm updates in realtime (so updates with each new datapoint)

For such a situation, a boundary value needs to be constructed which triggers signals. However, the boundary value cannot be static and must be determined realtime based on an algorithm.


My Question: what is a good algorithm to calculate such thresholds in realtime? Are there specific algorithms for such situations? What are the most well-known algorithms?


Robust algorithms or useful insights are all highly appreciated. (can answer in any language: it's about the algorithm)

Jean-Paul
  • 19,910
  • 9
  • 62
  • 88

40 Answers40

614

Robust peak detection algorithm (using z-scores)

I came up with an algorithm that works very well for these types of datasets. It is based on the principle of dispersion: if a new datapoint is a given x number of standard deviations away from a moving mean, the algorithm gives a signal. The algorithm is very robust because it constructs a separate moving mean and deviation, such that previous signals do not corrupt the signalling threshold for future signals. The sensitivity of the algorithm is therefore robust to previous signals.

The algorithm takes 3 inputs:

  1. Lag. The lag of the moving window that calculates the mean and standard deviation of historical data. A longer window takes more historical data in account. A shorter window is more adaptive, such that the algorithm will adapt to new information more quickly. For example, a lag of 5 will use the last 5 observations to smooth the data.
  2. Threshold. The "z-score" at which the algorithm signals. Simply put, if the distance between a new datapoint and the moving mean is larger than the threshold multiplied with the moving standard deviation of the data, the algorithm provides a signal. For example, a threshold of 3.5 will signal if a datapoint is 3.5 standard deviations away from the moving mean.
  3. Influence. The influence (between 0 and 1) of new signals on the calculation of the moving mean and moving standard deviation. For example, an influence parameter of 0.5 gives new signals half of the influence that normal datapoints have. Likewise, an influence of 0 ignores signals completely for recalculating the new threshold. An influence of 0 is therefore the most robust option (but assumes stationarity); putting the influence option at 1 is least robust. For non-stationary data, the influence option should therefore be put between 0 and 1.

It works as follows:

Pseudocode

# Let y be a vector of timeseries data of at least length lag+2
# Let mean() be a function that calculates the mean
# Let std() be a function that calculates the standard deviaton
# Let absolute() be the absolute value function

# Settings (these are examples: choose what is best for your data!)
set lag to 5;          # average and std. are based on past 5 observations
set threshold to 3.5;  # signal when data point is 3.5 std. away from average
set influence to 0.5;  # between 0 (no influence) and 1 (full influence)

# Initialize variables
set signals to vector 0,...,0 of length of y;   # Initialize signal results
set filteredY to y(1),...,y(lag)                # Initialize filtered series
set avgFilter to null;                          # Initialize average filter
set stdFilter to null;                          # Initialize std. filter
set avgFilter(lag) to mean(y(1),...,y(lag));    # Initialize first value average
set stdFilter(lag) to std(y(1),...,y(lag));     # Initialize first value std.

for i=lag+1,...,t do
  if absolute(y(i) - avgFilter(i-1)) > threshold*stdFilter(i-1) then
    if y(i) > avgFilter(i-1) then
      set signals(i) to +1;                     # Positive signal
    else
      set signals(i) to -1;                     # Negative signal
    end
    set filteredY(i) to influence*y(i) + (1-influence)*filteredY(i-1);
  else
    set signals(i) to 0;                        # No signal
    set filteredY(i) to y(i);
  end
  set avgFilter(i) to mean(filteredY(i-lag+1),...,filteredY(i));
  set stdFilter(i) to std(filteredY(i-lag+1),...,filteredY(i));
end

Rules of thumb for selecting good parameters for your data can be found below.


Demo

Demonstration of robust thresholding algorithm

The Matlab code for this demo can be found here. To use the demo, simply run it and create a time series yourself by clicking on the upper chart. The algorithm starts working after drawing lag number of observations.


Result

For the original question, this algorithm will give the following output when using the following settings: lag = 30, threshold = 5, influence = 0:

Thresholding algorithm example


Implementations in different programming languages:


Rules of thumb for configuring the algorithm

Lag. The lag parameter determines how much your data will be smoothed and how adaptive the algorithm is to changes in the long-term average of the data. The more stationary your data is, the more lags you should include (this should improve the robustness of the algorithm). If your data contains time-varying trends, you should consider how quickly you want the algorithm to adapt to these trends. I.e., if you put lag at 10, it takes 10 'periods' before the algorithm's treshold is adjusted to any systematic changes in the long-term average. So choose the lag parameter based on the trending behavior of your data and how adaptive you want the algorithm to be.

Influence. This parameter determines the influence of signals on the algorithm's detection threshold. If put at 0, signals have no influence on the threshold, such that future signals are detected based on a threshold that is calculated with a mean and standard deviation that is not influenced by past signals. If put at 0.5, signals have half the influence of normal data points. Another way to think about this is that if you put the influence at 0, you implicitly assume stationarity (i.e. no matter how many signals there are, you always expect the time series to return to the same average over the long term). If this is not the case, you should put the influence parameter somewhere between 0 and 1, depending on the extent to which signals can systematically influence the time-varying trend of the data. E.g., if signals lead to a structural break of the long-term average of the time series, the influence parameter should be put high (close to 1) so the threshold can react to structural breaks quickly.

Threshold. The threshold parameter is the number of standard deviations from the moving mean above which the algorithm will classify a new datapoint as being a signal. For example, if a new datapoint is 4.0 standard deviations above the moving mean and the threshold parameter is set as 3.5, the algorithm will identify the datapoint as a signal. This parameter should be set based on how many signals you expect. For example, if your data is normally distributed, a threshold (or: z-score) of 3.5 corresponds to a signaling probability of 0.00047 (from this table), which implies that you expect a signal once every 2128 datapoints (1/0.00047). The threshold therefore directly influences how sensitive the algorithm is and thereby also determines how often the algorithm signals. Examine your own data and choose a sensible threshold that makes the algorithm signal when you want it to (some trial-and-error might be needed here to get to a good threshold for your purpose).


WARNING: The code above always loops over all datapoints everytime it runs. When implementing this code, make sure to split the calculation of the signal into a separate function (without the loop). Then when a new datapoint arrives, update filteredY, avgFilter and stdFilter once. Do not recalculate the signals for all data everytime there is a new datapoint (like in the example above), that would be extremely inefficient and slow in real-time applications.

Other ways to modify the algorithm (for potential improvements) are:

  1. Use median instead of mean
  2. Use a robust measure of scale, such as the median absolute deviation (MAD), instead of the standard deviation
  3. Use a signalling margin, so the signal doesn't switch too often
  4. Change the way the influence parameter works
  5. Treat up and down signals differently (asymmetric treatment)
  6. Create a separate influence parameter for the mean and std (as in this Swift translation)

(Known) academic citations to this StackOverflow answer:

Other works using the algorithm from this answer

Other applications of the algorithm from this answer

Links to other peak detection algorithms


How to reference this algorithm:

Brakel, J.P.G. van (2014). "Robust peak detection algorithm using z-scores". Stack Overflow. Available at: https://stackoverflow.com/questions/22583391/peak-signal-detection-in-realtime-timeseries-data/22640362#22640362 (version: 2020-11-08).

Bibtex @misc{brakel2014, author = {Brakel, J.P.G van}, title = {Robust peak detection algorithm using z-scores}, url = {https://stackoverflow.com/questions/22583391/peak-signal-detection-in-realtime-timeseries-data/22640362#22640362}, language = {en}, year = {2014}, urldate = {2022-04-12}, journal = {Stack Overflow}, howpublished = {https://stackoverflow.com/questions/22583391/peak-signal-detection-in-realtime-timeseries-data/22640362#22640362}}


If you use this function somewhere, please credit me by using above reference. If you have any questions about the algorithm, post them in the comments below or contact me on LinkedIn.


Jean-Paul
  • 19,910
  • 9
  • 62
  • 88
  • The link to movingstd is broken, but you can find a description of it [here](http://www.mathworks.com/matlabcentral/newsreader/view_thread/123849) – Phylliida Jun 09 '16 at 23:29
  • @reasra Turns out the function doesn't need a moving standard deviation after rewriting. It can now be used with simple built-in Matlab functions :) – Jean-Paul Oct 26 '16 at 07:52
  • 1
    I'm trying the Matlab code for some accelerometer data, but for some reason the `threshold` graph just becomes a flat green line after a big spike up to 20 in the data, and it stays like that for the rest of the graph... If I remove the sike, this doesn't happen, so it seems to be caused by the spike in the data. Any idea what could be going on? I'm a newbie in Matlab, so I can't figure it out... – Magnus Apr 28 '17 at 03:43
  • @BadCash Can you provide an example (with the data)? Perhaps ask your own question here on SO and tell me the link? – Jean-Paul Apr 28 '17 at 07:11
  • @Jean-Paul I did :) http://stackoverflow.com/q/43666284/930640 The data for the graphs in the question is about 2000 points, so a bit too big to post on SO... – Magnus Apr 28 '17 at 11:36
  • 2
    There are many ways to improve this algo, so be creative (different treatment up/ down; median instead of mean; robust std; writing the code as a memory-efficient function; threshold margin so the signal doesn't switch too often, etc.). – Jean-Paul Aug 01 '17 at 11:40
  • @TheDataScientician You can make it more robust yourself by setting e.g. the `influence` parameter to 0 or by using the MAD instead of a simple standard deviation. This algorithm is very robust to white noise because the threshold is specified as x number of standard deviations away from the mean, which is the same principle as, for example, t-statistics. In other words, you can specify the probability of rejection yourself by specifying the threshold. Assuming Gaussian white noise, a threshold of 6 would be equivalent to approximately a six sigma event. – Jean-Paul Jun 05 '18 at 13:29
  • @Jean-Paul what happens if the sigma is 0? Edit: I agree that some of the modifications you proposed make it more robust, yes. – The Data Scientician Jun 05 '18 at 13:36
  • @TheDataScientician Your argument is out of scope: check the question at the top of this page. The assumption stated there is: *"There is basic noise with a general mean"* – Jean-Paul Jun 05 '18 at 15:06
  • This algorithm might only work if the distribution of peak heights is Gaussian, or nearly so. It could fail (badly) if the distribution of peak heights is fat or heavy tailed. – hotpaw2 Jan 09 '19 at 01:54
  • @hotpaw2 Agreed, but the algorithm can easily be adjusted to use robust statistics instead (e.g. MAD estimates of Cauchy [t=1] vs Gaussian [N(0,1)] will be approximately the same). Additionally, for a higher frequency of signals one can also tune the `threshold` and `influence` parameters to be more strict (so even if a z-score of e.g. 2.5 corresponds to a signaling probability larger than 0.6%, one can simply adjust the `threshold` to be stricter) – Jean-Paul Jan 09 '19 at 08:50
  • Jean-Paul, I am trying to use this algorithm(C++) in my project to detect peaks(lower), but I have difficulty in choosing the optimal lag, and threshold value. I have a sequence of number which are median pixel values of an ecg paper(https://ibb.co/cDbCZh7) and I want to detect the grids which are basically the lower peaks links(https://ibb.co/T87kW9P). However, inside a gird there are also smaller grids which I am not interested(do not want to detect them). Using this algorithm how am I able to do that so that it would work for most ecg paper a user imported? Thanks in advance. – Konjit W Jul 06 '20 at 10:36
  • Is there a sort of canonical test set of expected inputs/outputs with chosen factors? I wanted to write a more-stream-y/less-array-ie implementation, but would love to have a test set to test against. – Travis Griggs May 23 '21 at 01:21
  • @Jean-Paul how do you recommend to treat cases where the peak/valley is at beginning of the time series before the lag (for example the first data point in the series is way lower than the rest of the series but cannot still be detected with IQR method)? – datapug Jul 07 '21 at 10:46
  • @datapug Just treat it exactly the same way. As soon as there are `lag` number of observations the algorithm should then trigger a signal that the data has a significant peak, telling you that the first data points were significantly lower. You can make this setup more adaptive by using a shorter `lag`. Otherwise, there is nothing you can do: any algorithm will need to have context before it can provide signals. – Jean-Paul Jul 07 '21 at 10:51
  • @Jean-Paul I was thinking to reverse the time-series and run the algo again. What do you think about this (beside not very efficient I know..)? Otherwise, it might not triggering always the peaks without being too sensitive in case like [1, 1, 100, 101, 100, 100, 101, 110, 103, 105] – datapug Jul 07 '21 at 10:59
  • 2
    @datapug The algorithm is specifically designed to work on real-time data, so future values are unknown at the moment of calculating the signal. Do you have ex-ante information about the entire time-series? In that case you can just reverse the data indeed. – Jean-Paul Jul 07 '21 at 13:10
  • @Jean-Paul The whole "influence" is the same as implementing Exponential moving average? I'm just making sure I understand it right. – HBZ Aug 19 '21 at 07:24
  • 1
    @HBZ Correct. This is directly related to an EMA filter indeed. Big difference to standard moving average filters of course is that it’s only applied if a signal has been identified so the user can control robustness. – Jean-Paul Aug 19 '21 at 08:48
  • @Jean-Paul when influence is set to zero the average remains the same - which is great, but the standard deviation is changing to zero after lag signals arrive. is this on purpose? doesn't feel right to me... – Yitzchak Sep 30 '21 at 10:22
  • @Yitzchak When influence is set to zero, the standard deviation is calculated over the actual data (ignoring signals). If your average data has a standard deviation larger than zero, the standard deviation threshold will also be larger than zero. See [demonstration here](https://i.imgur.com/IfPCFU1.gif). – Jean-Paul Sep 30 '21 at 11:05
  • 2
    @Jean-Paul, Yeah now I see.. my issue was I tried to simulate a peak which caused some issue which I can't explain.. See here: https://imgur.com/a/GFz59jl As you can see - after the signals are getting back to original values - the standard deviation stays 0 – Yitzchak Sep 30 '21 at 11:27
  • 2
    @Yitzchak I see. Indeed the algorithm assumes that the duration of the signals is less than the duration of the peak. In your case indeed the st.dev. will tend to zero (because every `filteredY(i) = filteredY(i-1)`). If you want to make the algorithm work for your case (`influence = 0`), a quick-and-dirty solution is to change the line `set filteredY(i) to influence*y(i) + (1-influence)*filteredY(i-1);` to `set filteredY(i) to filteredY(i-lag)`. That way the threshold will simply recycle values from before the peak occurred. See [demonstration here](https://i.imgur.com/A2ZJVgn.gif). – Jean-Paul Sep 30 '21 at 11:55
  • @jean-paul , if the entire dataset is known in advance, is there a way to optimize the algorithm? Can we more optically set the arguments? – Dmytro Bogatov Jan 03 '22 at 03:03
  • 1
    @DmytroBogatov If the entire dataset is known in advance you can use both historical and future values to calculate your moving average filter. Hence in that case you can center your moving average window around the observation you are trying to find the peak for. There are also plenty of other algorithm you can use if you know all the data in advance. See for example https://docs.scipy.org/doc/scipy/reference/signal.html#peak-finding – Jean-Paul Jan 03 '22 at 09:30
  • Imagine you have a gaussian background signal, where you know that no rare event occurred. You use it to find std and then you test a suspicious part of signal against this std. If any value exceeds 2*std, you signal a rare event. How do you estimate the rate of rare events? If it is gaussian then 5% of points should be above 2 * std. Then, whether a rare point is found depends on the number of points we test (and their autocorrelation), it is already quite likely in only 20 points. Seems to me like a complex first-passage problem, far from a simple 2 sigma - 5 % rule. Any thoughts on that ? – Maciek Feb 16 '22 at 19:11
  • @Maciek If the signal has gaussian background noise, you can directly calculate the signalling probability. For example, if you classify both positive and negative deviations as signals, a standard deviation higher than +/- 3.5 has a CDF probability of 0.00047. Which implies you'll get a signal approximately every 2128 datapoints. The probabilities/frequencies can be mapped to a standard deviation threshold using a standard two-tailed t-statistic table [like this one](https://i.imgur.com/vS08Kdt.png). – Jean-Paul Feb 22 '22 at 12:50
  • @Jean-Paul Thank you, I already do it like that. The problem, however, is that it is not only about the distribution of the signal values (whether they come from gaussian or not), it is also about their temporal correlation - this is why I mentioned first passage time. Note that for the white noise limit the solution depends on the number of points, thus sampling frequency, which is not physical. In the fist passage approach, it does not. – Maciek Feb 23 '22 at 13:09
  • @Maciek Agreed. The probability calculations only work if we assume the datapoints are independently distributed. Otherwise the calculation becomes more involved indeed. – Jean-Paul Feb 23 '22 at 15:46
  • @Jean-Paul Can this algorithm be used to slit velocity graph as show in this picture [Velocity graph](https://imgur.com/FregYek) I tried different settings for lag, influence, threshold but does not help. – Bhupiister singh Feb 24 '22 at 13:20
  • @Bhupiistersingh Velocity behaves quite nicely, so you don't even need this algorithm. Just check the first difference of the timeseries (new datapoint - old datapoint) and signal when the absolute value of the difference is larger than some predefined threshold. That should pick up all your signals. In addition, you may add a moving window to smooth out small differences. I hope that helps! Good luck – Jean-Paul Feb 24 '22 at 13:42
  • @Jean-Paul because of higher sampling rate(400 hz) should i compute mean of a fixed window and compare it with the next window? – Bhupiister singh Feb 24 '22 at 13:47
  • @Bhupiistersingh Yes, that is one way to do it. Unfortunately comments are not for extended discussions. Please feel free to post a question on https://dsp.stackexchange.com/ with your data and explanation of what you are trying to achieve. – Jean-Paul Feb 24 '22 at 13:52
  • 1
    If anyone wants a bibtex to reference this algorithm: ` @misc{69, author = {Brakel, J.P.G van}, title = {Robust peak detection algorithm using z-scores}, url = {https://stackoverflow.com/questions/22583391/peak-signal-detection-in-realtime-timeseries-data/22640362#22640362}, language = {en}, year = {2014}, urldate = {2022-04-12}, journal = {Stack Overflow}, howpublished = {https://stackoverflow.com/questions/22583391/peak-signal-detection-in-realtime-timeseries-data/22640362#22640362} } ` – user2906838 May 12 '22 at 09:00
  • 1
    There is something wrong with brad's C++ implementation and it should be removed from this list. I am also suspecting it is why I was unable to tune the algorithm properly. – Jason C Nov 12 '22 at 17:36
  • Yeah; with a correct implementation it works as intended. Unfortunately, the major weakness of this algorithm is when your dataset has significant peaks in < *lag* samples, since it skips the first *lag* data points. The algorithm could (and should) probably be modified to check for peaks in the first *lag* points using the first *lag* mean (or building up the mean). Alternatively, if you know you don't have meaningful peaks at the *end* of the data set, it can be run against a reversed data set (which is what I ended up doing -- dataset was power spectrum of vehicle engine noise). – Jason C Nov 12 '22 at 18:24
  • 1
    I've created a [C++ version of the demo](https://stackoverflow.com/a/74416006/616460) with interactive parameter adjustments. – Jason C Nov 12 '22 at 19:42
  • 1
    Reversing the dataset and setting the lag and influence high works *really* well for power spectra. – Jason C Nov 12 '22 at 20:28
  • @Jean-Paul I can't make this work with my dataset... I am trying also an FFT approach. You help would be very valuable: https://stackoverflow.com/questions/74836280/how-can-i-get-the-beats-per-minute-using-fft-on-the-amplitude-data – Zibri Dec 17 '22 at 18:10
63

Here is the Python / numpy implementation of the smoothed z-score algorithm (see answer above). You can find the gist here.

#!/usr/bin/env python
# Implementation of algorithm from https://stackoverflow.com/a/22640362/6029703
import numpy as np
import pylab

def thresholding_algo(y, lag, threshold, influence):
    signals = np.zeros(len(y))
    filteredY = np.array(y)
    avgFilter = [0]*len(y)
    stdFilter = [0]*len(y)
    avgFilter[lag - 1] = np.mean(y[0:lag])
    stdFilter[lag - 1] = np.std(y[0:lag])
    for i in range(lag, len(y)):
        if abs(y[i] - avgFilter[i-1]) > threshold * stdFilter [i-1]:
            if y[i] > avgFilter[i-1]:
                signals[i] = 1
            else:
                signals[i] = -1

            filteredY[i] = influence * y[i] + (1 - influence) * filteredY[i-1]
            avgFilter[i] = np.mean(filteredY[(i-lag+1):i+1])
            stdFilter[i] = np.std(filteredY[(i-lag+1):i+1])
        else:
            signals[i] = 0
            filteredY[i] = y[i]
            avgFilter[i] = np.mean(filteredY[(i-lag+1):i+1])
            stdFilter[i] = np.std(filteredY[(i-lag+1):i+1])

    return dict(signals = np.asarray(signals),
                avgFilter = np.asarray(avgFilter),
                stdFilter = np.asarray(stdFilter))

Below is the test on the same dataset that yields the same plot as in the original answer for R/Matlab

# Data
y = np.array([1,1,1.1,1,0.9,1,1,1.1,1,0.9,1,1.1,1,1,0.9,1,1,1.1,1,1,1,1,1.1,0.9,1,1.1,1,1,0.9,
       1,1.1,1,1,1.1,1,0.8,0.9,1,1.2,0.9,1,1,1.1,1.2,1,1.5,1,3,2,5,3,2,1,1,1,0.9,1,1,3,
       2.6,4,3,3.2,2,1,1,0.8,4,4,2,2.5,1,1,1])

# Settings: lag = 30, threshold = 5, influence = 0
lag = 30
threshold = 5
influence = 0

# Run algo with settings from above
result = thresholding_algo(y, lag=lag, threshold=threshold, influence=influence)

# Plot result
pylab.subplot(211)
pylab.plot(np.arange(1, len(y)+1), y)

pylab.plot(np.arange(1, len(y)+1),
           result["avgFilter"], color="cyan", lw=2)

pylab.plot(np.arange(1, len(y)+1),
           result["avgFilter"] + threshold * result["stdFilter"], color="green", lw=2)

pylab.plot(np.arange(1, len(y)+1),
           result["avgFilter"] - threshold * result["stdFilter"], color="green", lw=2)

pylab.subplot(212)
pylab.step(np.arange(1, len(y)+1), result["signals"], color="red", lw=2)
pylab.ylim(-1.5, 1.5)
pylab.show()
R Kiselev
  • 1,124
  • 1
  • 9
  • 18
  • Over here 'y' is actually the signal and 'signals' is the set of data points, am I correct in understanding? – TheTank Mar 22 '19 at 17:17
  • 2
    @TheTank `y` is the data array you pass in, `signals` is the `+1` or `-1` output array that indicate for each datapoint `y[i]` whether that datapoint is a "significant peak" given the settings you use. – Jean-Paul Nov 15 '19 at 08:37
31

One approach is to detect peaks based on the following observation:

  • Time t is a peak if (y(t) > y(t-1)) && (y(t) > y(t+1))

It avoids false positives by waiting until the uptrend is over. It is not exactly "real-time" in the sense that it will miss the peak by one dt. sensitivity can be controlled by requiring a margin for comparison. There is a trade off between noisy detection and time delay of detection. You can enrich the model by adding more parameters:

  • peak if (y(t) - y(t-dt) > m) && (y(t) - y(t+dt) > m)

where dt and m are parameters to control sensitivity vs time-delay

This is what you get with the mentioned algorithm: enter image description here

here is the code to reproduce the plot in python:

import numpy as np
import matplotlib.pyplot as plt
input = np.array([ 1. ,  1. ,  1. ,  1. ,  1. ,  1. ,  1. ,  1.1,  1. ,  0.8,  0.9,
    1. ,  1.2,  0.9,  1. ,  1. ,  1.1,  1.2,  1. ,  1.5,  1. ,  3. ,
    2. ,  5. ,  3. ,  2. ,  1. ,  1. ,  1. ,  0.9,  1. ,  1. ,  3. ,
    2.6,  4. ,  3. ,  3.2,  2. ,  1. ,  1. ,  1. ,  1. ,  1. ])
signal = (input > np.roll(input,1)) & (input > np.roll(input,-1))
plt.plot(input)
plt.plot(signal.nonzero()[0], input[signal], 'ro')
plt.show()

By setting m = 0.5, you can get a cleaner signal with only one false positive: enter image description here

aha
  • 4,314
  • 4
  • 23
  • 27
  • How would I go about changing the sensitivity? – Jean-Paul Mar 23 '14 at 00:56
  • I can think of two approaches: 1: set **m** to a larger value so that only larger peaks are detected. 2: instead of calculating y(t) - y(t-dt) (and y(t) - y(t+dt)), you integrate from t-dt to t (and t to t+dt). – aha Mar 23 '14 at 01:16
  • 3
    By what criteria are you rejecting the other 7 peaks? – hotpaw2 Mar 23 '14 at 14:09
  • 6
    There is a problem with flat peaks, since what you do is basicly 1-D edge detection (like convoluting the signal with [1 0 -1]) – ben Mar 26 '14 at 07:19
24

In signal processing, peak detection is often done via wavelet transform. You basically do a discrete wavelet transform on your time series data. Zero-crossings in the detail coefficients that are returned will correspond to peaks in the time series signal. You get different peak amplitudes detected at different detail coefficient levels, which gives you multi-level resolution.

cklin
  • 900
  • 4
  • 16
  • 1
    Your answer let me to [this article](http://en.wikipedia.org/wiki/Change_detection) and [this answer](http://stackoverflow.com/questions/12851208/how-to-detect-significant-change-trend-in-a-time-series-data/13660660#13660660) which will help me construct a good algorithm for my implementation. Thanks! – Jean-Paul Mar 31 '14 at 21:32
20

Python version that works with real-time streams (doesn't recalculate all data points on arrival of each new data point). You may want to tweak what the class function returns - for my purposes I just needed the signals.

import numpy as np


class real_time_peak_detection():
    def __init__(self, array, lag, threshold, influence):
        self.y = list(array)
        self.length = len(self.y)
        self.lag = lag
        self.threshold = threshold
        self.influence = influence
        self.signals = [0] * len(self.y)
        self.filteredY = np.array(self.y).tolist()
        self.avgFilter = [0] * len(self.y)
        self.stdFilter = [0] * len(self.y)
        self.avgFilter[self.lag - 1] = np.mean(self.y[0:self.lag]).tolist()
        self.stdFilter[self.lag - 1] = np.std(self.y[0:self.lag]).tolist()

    def thresholding_algo(self, new_value):
        self.y.append(new_value)
        i = len(self.y) - 1
        self.length = len(self.y)
        if i < self.lag:
            return 0
        elif i == self.lag:
            self.signals = [0] * len(self.y)
            self.filteredY = np.array(self.y).tolist()
            self.avgFilter = [0] * len(self.y)
            self.stdFilter = [0] * len(self.y)
            self.avgFilter[self.lag] = np.mean(self.y[0:self.lag]).tolist()
            self.stdFilter[self.lag] = np.std(self.y[0:self.lag]).tolist()
            return 0

        self.signals += [0]
        self.filteredY += [0]
        self.avgFilter += [0]
        self.stdFilter += [0]

        if abs(self.y[i] - self.avgFilter[i - 1]) > (self.threshold * self.stdFilter[i - 1]):

            if self.y[i] > self.avgFilter[i - 1]:
                self.signals[i] = 1
            else:
                self.signals[i] = -1

            self.filteredY[i] = self.influence * self.y[i] + \
                (1 - self.influence) * self.filteredY[i - 1]
            self.avgFilter[i] = np.mean(self.filteredY[(i - self.lag):i])
            self.stdFilter[i] = np.std(self.filteredY[(i - self.lag):i])
        else:
            self.signals[i] = 0
            self.filteredY[i] = self.y[i]
            self.avgFilter[i] = np.mean(self.filteredY[(i - self.lag):i])
            self.stdFilter[i] = np.std(self.filteredY[(i - self.lag):i])

        return self.signals[i]
matan h
  • 900
  • 1
  • 10
  • 19
delica
  • 1,647
  • 13
  • 17
14

In computational topology the idea of persistent homology leads to an efficient – fast as sorting numbers – solution. It does not only detect peaks, it quantifies the "significance" of the peaks in a natural way that allows you to select the peaks that are significant for you.

Algorithm summary. In a 1-dimensional setting (time series, real-valued signal) the algorithm can be easily described by the following figure:

Most persistent peaks

Think of the function graph (or its sub-level set) as a landscape and consider a decreasing water level starting at level infinity (or 1.8 in this picture). While the level decreases, at local maxima islands pop up. At local minima these islands merge together. One detail in this idea is that the island that appeared later in time is merged into the island that is older. The "persistence" of an island is its birth time minus its death time. The lengths of the blue bars depict the persistence, which is the above mentioned "significance" of a peak.

Efficiency. It is not too hard to find an implementation that runs in linear time – in fact it is a single, simple loop – after the function values were sorted. So this implementation should be fast in practice and is easily implemented, too.

References. A write-up of the entire story and references to the motivation from persistent homology (a field in computatioal algebraic topology) can be found here: https://www.sthu.org/blog/13-perstopology-peakdetection/index.html

S. Huber
  • 933
  • 13
  • 25
  • This algorithim is much faster and more accurate than, for example, scipy.signal.find_peaks. For a "real" time-series with 1053896 data points, it detected 137516 peaks (13%). The order of the peaks (most significant first) allows the most significant peaks to be extracted. It provides the start, peak, and end of each peak. Works well with noisy data. – vinh Sep 02 '19 at 14:30
  • 1
    By real-time data you mean a so-called online algorithm, where data points are received time after time. The significance of a peak might be determined by values in the future. It would be nice to extend the algorithm to become online by modifying the past results without sacrificing the time complexity too much. – S. Huber Oct 09 '19 at 12:54
  • The length of the blue bars don't make sense to me. It looks like they always refer to the preceding local minimum, but never refer to the following one. They should refer to both in my opinion, which means that #1 and 3 would be shorter. – Tobias Knauss Mar 13 '22 at 09:18
  • First, it is true that the blue bars start at local minimum. However, it is not always the local minimum next left. In fact, it does even need to be the next right one neither. It is the one that causes the merge of the components during the watersheding process. In this particular real-world example it only seems that way because it is in the nature of frequency-response curves that they have a declining trend with vanishing oscillation. But if you look closely at #3 then a small local minimum to the left is actually skipped. – S. Huber Mar 14 '22 at 12:38
  • Moreover, we can generalize this actually to 2D and here "left" and "right" has lost its meaning. See https://www.sthu.org/code/codesnippets/imagepers.html and the two stackoverflow questions linked. – S. Huber Mar 14 '22 at 12:42
  • 1
    I implemented the same algorithm in C++ which is about 45x faster than the Python implementation linked above. The C++ implementation is available [here](https://gitlab.com/balping/find-peaks). Enjoy. – balping Apr 19 '22 at 00:23
  • @S.Huber: Even in 2D you could create blue bars appropriately: The length of the bar should be peak minus closest local minimum, where 'closest' relates to the linear distance in the XY plane. – Tobias Knauss Apr 23 '22 at 16:45
  • Still it is wrong. This is just not how peristent homology or watersheding works. (Btw, the usual notion for "linear distance" is "Euclidean distance".) – S. Huber Apr 24 '22 at 18:07
14

We’ve attempted to use the smoothed z-score algorithm on our dataset, which results in either oversensitivity or undersensitivity (depending on how the parameters are tuned), with little middle ground. In our site’s traffic signal, we’ve observed a low frequency baseline which represents the daily cycle and even with the best possible parameters (shown below), it still trailed off especially on the 4th day because most of the data points are recognized as anomaly.

Building on top of the original z-score algorithm, we came up a way to solve this problem by reverse filtering. The details of the modified algorithm and its application on TV commercial trafic attribution are posted on our team blog.

enter image description here

jbm
  • 2,575
  • 16
  • 15
  • 2
    Cool to see that the algorithm was a starting point for your more advanced version. Your data has a very particular pattern, so it would indeed make more sense to first remove the pattern using some other technique and then apply the algo on the residuals. Alternatively, you might want to use a centered instead of a lagging window to calculate the average/ st.dev. Another comment: your solution moves from the right to the left to identify spikes, but this is not possible in real time applications (that's why the original algo is so simplistic, because future information is inaccessible). – Jean-Paul Dec 12 '18 at 11:45
13

Appendix 1 to original answer: Matlab and R translations

Matlab code

function [signals,avgFilter,stdFilter] = ThresholdingAlgo(y,lag,threshold,influence)
% Initialise signal results
signals = zeros(length(y),1);
% Initialise filtered series
filteredY = y(1:lag+1);
% Initialise filters
avgFilter(lag+1,1) = mean(y(1:lag+1));
stdFilter(lag+1,1) = std(y(1:lag+1));
% Loop over all datapoints y(lag+2),...,y(t)
for i=lag+2:length(y)
    % If new value is a specified number of deviations away
    if abs(y(i)-avgFilter(i-1)) > threshold*stdFilter(i-1)
        if y(i) > avgFilter(i-1)
            % Positive signal
            signals(i) = 1;
        else
            % Negative signal
            signals(i) = -1;
        end
        % Make influence lower
        filteredY(i) = influence*y(i)+(1-influence)*filteredY(i-1);
    else
        % No signal
        signals(i) = 0;
        filteredY(i) = y(i);
    end
    % Adjust the filters
    avgFilter(i) = mean(filteredY(i-lag:i));
    stdFilter(i) = std(filteredY(i-lag:i));
end
% Done, now return results
end

Example:

% Data
y = [1 1 1.1 1 0.9 1 1 1.1 1 0.9 1 1.1 1 1 0.9 1 1 1.1 1 1,...
    1 1 1.1 0.9 1 1.1 1 1 0.9 1 1.1 1 1 1.1 1 0.8 0.9 1 1.2 0.9 1,...
    1 1.1 1.2 1 1.5 1 3 2 5 3 2 1 1 1 0.9 1,...
    1 3 2.6 4 3 3.2 2 1 1 0.8 4 4 2 2.5 1 1 1];

% Settings
lag = 30;
threshold = 5;
influence = 0;

% Get results
[signals,avg,dev] = ThresholdingAlgo(y,lag,threshold,influence);

figure; subplot(2,1,1); hold on;
x = 1:length(y); ix = lag+1:length(y);
area(x(ix),avg(ix)+threshold*dev(ix),'FaceColor',[0.9 0.9 0.9],'EdgeColor','none');
area(x(ix),avg(ix)-threshold*dev(ix),'FaceColor',[1 1 1],'EdgeColor','none');
plot(x(ix),avg(ix),'LineWidth',1,'Color','cyan','LineWidth',1.5);
plot(x(ix),avg(ix)+threshold*dev(ix),'LineWidth',1,'Color','green','LineWidth',1.5);
plot(x(ix),avg(ix)-threshold*dev(ix),'LineWidth',1,'Color','green','LineWidth',1.5);
plot(1:length(y),y,'b');
subplot(2,1,2);
stairs(signals,'r','LineWidth',1.5); ylim([-1.5 1.5]);

R code

ThresholdingAlgo <- function(y,lag,threshold,influence) {
  signals <- rep(0,length(y))
  filteredY <- y[1:lag]
  avgFilter <- NULL
  stdFilter <- NULL
  avgFilter[lag] <- mean(y[1:lag], na.rm=TRUE)
  stdFilter[lag] <- sd(y[1:lag], na.rm=TRUE)
  for (i in (lag+1):length(y)){
    if (abs(y[i]-avgFilter[i-1]) > threshold*stdFilter[i-1]) {
      if (y[i] > avgFilter[i-1]) {
        signals[i] <- 1;
      } else {
        signals[i] <- -1;
      }
      filteredY[i] <- influence*y[i]+(1-influence)*filteredY[i-1]
    } else {
      signals[i] <- 0
      filteredY[i] <- y[i]
    }
    avgFilter[i] <- mean(filteredY[(i-lag):i], na.rm=TRUE)
    stdFilter[i] <- sd(filteredY[(i-lag):i], na.rm=TRUE)
  }
  return(list("signals"=signals,"avgFilter"=avgFilter,"stdFilter"=stdFilter))
}

Example:

# Data
y <- c(1,1,1.1,1,0.9,1,1,1.1,1,0.9,1,1.1,1,1,0.9,1,1,1.1,1,1,1,1,1.1,0.9,1,1.1,1,1,0.9,
       1,1.1,1,1,1.1,1,0.8,0.9,1,1.2,0.9,1,1,1.1,1.2,1,1.5,1,3,2,5,3,2,1,1,1,0.9,1,1,3,
       2.6,4,3,3.2,2,1,1,0.8,4,4,2,2.5,1,1,1)

lag       <- 30
threshold <- 5
influence <- 0

# Run algo with lag = 30, threshold = 5, influence = 0
result <- ThresholdingAlgo(y,lag,threshold,influence)

# Plot result
par(mfrow = c(2,1),oma = c(2,2,0,0) + 0.1,mar = c(0,0,2,1) + 0.2)
plot(1:length(y),y,type="l",ylab="",xlab="") 
lines(1:length(y),result$avgFilter,type="l",col="cyan",lwd=2)
lines(1:length(y),result$avgFilter+threshold*result$stdFilter,type="l",col="green",lwd=2)
lines(1:length(y),result$avgFilter-threshold*result$stdFilter,type="l",col="green",lwd=2)
plot(result$signals,type="S",col="red",ylab="",xlab="",ylim=c(-1.5,1.5),lwd=2)

This code (both languages) will yield the following result for the data of the original question:

Thresholding example from Matlab code


Appendix 2 to original answer: Matlab demonstration code

(click to create data)

Matlab demo

function [] = RobustThresholdingDemo()

%% SPECIFICATIONS
lag         = 5;       % lag for the smoothing
threshold   = 3.5;     % number of st.dev. away from the mean to signal
influence   = 0.3;     % when signal: how much influence for new data? (between 0 and 1)
                       % 1 is normal influence, 0.5 is half      
%% START DEMO
DemoScreen(30,lag,threshold,influence);

end

function [signals,avgFilter,stdFilter] = ThresholdingAlgo(y,lag,threshold,influence)
signals = zeros(length(y),1);
filteredY = y(1:lag+1);
avgFilter(lag+1,1) = mean(y(1:lag+1));
stdFilter(lag+1,1) = std(y(1:lag+1));
for i=lag+2:length(y)
    if abs(y(i)-avgFilter(i-1)) > threshold*stdFilter(i-1)
        if y(i) > avgFilter(i-1)
            signals(i) = 1;
        else
            signals(i) = -1;
        end
        filteredY(i) = influence*y(i)+(1-influence)*filteredY(i-1);
    else
        signals(i) = 0;
        filteredY(i) = y(i);
    end
    avgFilter(i) = mean(filteredY(i-lag:i));
    stdFilter(i) = std(filteredY(i-lag:i));
end
end

% Demo screen function
function [] = DemoScreen(n,lag,threshold,influence)
figure('Position',[200 100,1000,500]);
subplot(2,1,1);
title(sprintf(['Draw data points (%.0f max)      [settings: lag = %.0f, '...
    'threshold = %.2f, influence = %.2f]'],n,lag,threshold,influence));
ylim([0 5]); xlim([0 50]);
H = gca; subplot(2,1,1);
set(H, 'YLimMode', 'manual'); set(H, 'XLimMode', 'manual');
set(H, 'YLim', get(H,'YLim')); set(H, 'XLim', get(H,'XLim'));
xg = []; yg = [];
for i=1:n
    try
        [xi,yi] = ginput(1);
    catch
        return;
    end
    xg = [xg xi]; yg = [yg yi];
    if i == 1
        subplot(2,1,1); hold on;
        plot(H, xg(i),yg(i),'r.'); 
        text(xg(i),yg(i),num2str(i),'FontSize',7);
    end
    if length(xg) > lag
        [signals,avg,dev] = ...
            ThresholdingAlgo(yg,lag,threshold,influence);
        area(xg(lag+1:end),avg(lag+1:end)+threshold*dev(lag+1:end),...
            'FaceColor',[0.9 0.9 0.9],'EdgeColor','none');
        area(xg(lag+1:end),avg(lag+1:end)-threshold*dev(lag+1:end),...
            'FaceColor',[1 1 1],'EdgeColor','none');
        plot(xg(lag+1:end),avg(lag+1:end),'LineWidth',1,'Color','cyan');
        plot(xg(lag+1:end),avg(lag+1:end)+threshold*dev(lag+1:end),...
            'LineWidth',1,'Color','green');
        plot(xg(lag+1:end),avg(lag+1:end)-threshold*dev(lag+1:end),...
            'LineWidth',1,'Color','green');
        subplot(2,1,2); hold on; title('Signal output');
        stairs(xg(lag+1:end),signals(lag+1:end),'LineWidth',2,'Color','blue');
        ylim([-2 2]); xlim([0 50]); hold off;
    end
    subplot(2,1,1); hold on;
    for j=2:i
        plot(xg([j-1:j]),yg([j-1:j]),'r'); plot(H,xg(j),yg(j),'r.');
        text(xg(j),yg(j),num2str(j),'FontSize',7);
    end
end
end

gibbone
  • 2,300
  • 20
  • 20
Jean-Paul
  • 19,910
  • 9
  • 62
  • 88
11

Found another algorithm by Palshikar (2009) in:

Palshikar, G. (2009). Simple algorithms for peak detection in time-series. In Proc. 1st Int. Conf. Advanced Data Analysis, Business Analytics and Intelligence (Vol. 122).

Paper can be downloaded from here.

The algorithm goes like this:

algorithm peak1 // one peak detection algorithms that uses peak function S1 

input T = x1, x2, …, xN, N // input time-series of N points 
input k // window size around the peak 
input h // typically 1 <= h <= 3 
output O // set of peaks detected in T 

begin 
O = empty set // initially empty 

    for (i = 1; i < n; i++) do
        // compute peak function value for each of the N points in T 
        a[i] = S1(k,i,xi,T); 
    end for 

    Compute the mean m' and standard deviation s' of all positive values in array a; 

    for (i = 1; i < n; i++) do // remove local peaks which are “small” in global context 
        if (a[i] > 0 && (a[i] – m') >( h * s')) then O = O + {xi}; 
        end if 
    end for 

    Order peaks in O in terms of increasing index in T 

    // retain only one peak out of any set of peaks within distance k of each other 

    for every adjacent pair of peaks xi and xj in O do 
        if |j – i| <= k then remove the smaller value of {xi, xj} from O 
        end if 
    end for 
end

Advantages

  • The paper provides 5 different algorithms for peak detection
  • The algorithms work on the raw time-series data (no smoothing needed)

Disadvantages

  • Difficult to determine k and h beforehand
  • Peaks cannot be flat (like the third peak in my test data)

Example:

enter image description here

Jean-Paul
  • 19,910
  • 9
  • 62
  • 88
  • Actually interesting paper. S4 seems like a better function to use in his opinion. But more importantly is to clarify when k – daniels_pa Nov 15 '17 at 17:30
10

Here is a C++ implementation of the smoothed z-score algorithm from this answer

std::vector<int> smoothedZScore(std::vector<float> input)
{   
    //lag 5 for the smoothing functions
    int lag = 5;
    //3.5 standard deviations for signal
    float threshold = 3.5;
    //between 0 and 1, where 1 is normal influence, 0.5 is half
    float influence = .5;

    if (input.size() <= lag + 2)
    {
        std::vector<int> emptyVec;
        return emptyVec;
    }

    //Initialise variables
    std::vector<int> signals(input.size(), 0.0);
    std::vector<float> filteredY(input.size(), 0.0);
    std::vector<float> avgFilter(input.size(), 0.0);
    std::vector<float> stdFilter(input.size(), 0.0);
    std::vector<float> subVecStart(input.begin(), input.begin() + lag);
    avgFilter[lag] = mean(subVecStart);
    stdFilter[lag] = stdDev(subVecStart);

    for (size_t i = lag + 1; i < input.size(); i++)
    {
        if (std::abs(input[i] - avgFilter[i - 1]) > threshold * stdFilter[i - 1])
        {
            if (input[i] > avgFilter[i - 1])
            {
                signals[i] = 1; //# Positive signal
            }
            else
            {
                signals[i] = -1; //# Negative signal
            }
            //Make influence lower
            filteredY[i] = influence* input[i] + (1 - influence) * filteredY[i - 1];
        }
        else
        {
            signals[i] = 0; //# No signal
            filteredY[i] = input[i];
        }
        //Adjust the filters
        std::vector<float> subVec(filteredY.begin() + i - lag, filteredY.begin() + i);
        avgFilter[i] = mean(subVec);
        stdFilter[i] = stdDev(subVec);
    }
    return signals;
}
brad
  • 930
  • 9
  • 22
  • 3
    Caveat: This implementation does not actually provide a method to calculate the mean and standard deviation. For C++11, an easy method can be found here: https://stackoverflow.com/a/12405793/3250829 – rayryeng Nov 12 '17 at 05:24
  • 1
    Something is wrong with this implementation and it should not be used. I didn't do a full analysis but in porting the demo app to C++, this implementation yielded incorrect results. Removal of the intermediate subVec caused the results to be correct. So be warned. There is a fixed implementation snippet [here](https://gist.github.com/JC3/9f4fcb38b24a7cb5cfc71d44a09a3138) (Qt-based). – Jason C Nov 12 '22 at 17:33
10

Following on from @Jean-Paul's proposed solution, I have implemented his algorithm in C#

public class ZScoreOutput
{
    public List<double> input;
    public List<int> signals;
    public List<double> avgFilter;
    public List<double> filtered_stddev;
}

public static class ZScore
{
    public static ZScoreOutput StartAlgo(List<double> input, int lag, double threshold, double influence)
    {
        // init variables!
        int[] signals = new int[input.Count];
        double[] filteredY = new List<double>(input).ToArray();
        double[] avgFilter = new double[input.Count];
        double[] stdFilter = new double[input.Count];

        var initialWindow = new List<double>(filteredY).Skip(0).Take(lag).ToList();

        avgFilter[lag - 1] = Mean(initialWindow);
        stdFilter[lag - 1] = StdDev(initialWindow);

        for (int i = lag; i < input.Count; i++)
        {
            if (Math.Abs(input[i] - avgFilter[i - 1]) > threshold * stdFilter[i - 1])
            {
                signals[i] = (input[i] > avgFilter[i - 1]) ? 1 : -1;
                filteredY[i] = influence * input[i] + (1 - influence) * filteredY[i - 1];
            }
            else
            {
                signals[i] = 0;
                filteredY[i] = input[i];
            }

            // Update rolling average and deviation
            var slidingWindow = new List<double>(filteredY).Skip(i - lag).Take(lag+1).ToList();

            var tmpMean = Mean(slidingWindow);
            var tmpStdDev = StdDev(slidingWindow);

            avgFilter[i] = Mean(slidingWindow);
            stdFilter[i] = StdDev(slidingWindow);
        }

        // Copy to convenience class 
        var result = new ZScoreOutput();
        result.input = input;
        result.avgFilter       = new List<double>(avgFilter);
        result.signals         = new List<int>(signals);
        result.filtered_stddev = new List<double>(stdFilter);

        return result;
    }

    private static double Mean(List<double> list)
    {
        // Simple helper function! 
        return list.Average();
    }

    private static double StdDev(List<double> values)
    {
        double ret = 0;
        if (values.Count() > 0)
        {
            double avg = values.Average();
            double sum = values.Sum(d => Math.Pow(d - avg, 2));
            ret = Math.Sqrt((sum) / (values.Count() - 1));
        }
        return ret;
    }
}

Example usage:

var input = new List<double> {1.0, 1.0, 1.1, 1.0, 0.9, 1.0, 1.0, 1.1, 1.0, 0.9, 1.0,
    1.1, 1.0, 1.0, 0.9, 1.0, 1.0, 1.1, 1.0, 1.0, 1.0, 1.0, 1.1, 0.9, 1.0, 1.1, 1.0, 1.0, 0.9,
    1.0, 1.1, 1.0, 1.0, 1.1, 1.0, 0.8, 0.9, 1.0, 1.2, 0.9, 1.0, 1.0, 1.1, 1.2, 1.0, 1.5, 1.0,
    3.0, 2.0, 5.0, 3.0, 2.0, 1.0, 1.0, 1.0, 0.9, 1.0, 1.0, 3.0, 2.6, 4.0, 3.0, 3.2, 2.0, 1.0,
    1.0, 0.8, 4.0, 4.0, 2.0, 2.5, 1.0, 1.0, 1.0};

int lag = 30;
double threshold = 5.0;
double influence = 0.0;

var output = ZScore.StartAlgo(input, lag, threshold, influence);
user276648
  • 6,018
  • 6
  • 60
  • 86
Ocean Airdrop
  • 2,793
  • 1
  • 26
  • 31
  • 2
    Hi, I think there is an error in that code, in the method StdDev you take values.Count()-1, should there rely be -1? I think you would want the number of items and that is what you get from values.Count(). – Viktor Jul 12 '19 at 08:54
  • 3
    Hmm.. Good spot. Although I originally ported the algorithm to C#, I never ended up using it. I would probably replace that whole function with a call to the nuget library MathNet. "Install-Package MathNet.Numerics" It has prebuilt functions for PopulationStandardDeviation() and StandardDeviation(); eg. var populationStdDev = new List(1,2,3,4).PopulationStandardDeviation(); var sampleStdDev = new List(1,2,3,4).StandardDeviation(); – Ocean Airdrop Jul 12 '19 at 15:05
10

Here's a C implementation of @Jean-Paul's Smoothed Z-score for the Arduino microcontroller used to take accelerometer readings and decide whether the direction of an impact has come from the left or the right. This performs really well since this device returns a bounced signal. Here's this input to this peak detection algorithm from the device - showing an impact from the right followed by and impact from the left. You can see the initial spike then the oscillation of the sensor.

enter image description here

#include <stdio.h>
#include <math.h>
#include <string.h>


#define SAMPLE_LENGTH 1000

float stddev(float data[], int len);
float mean(float data[], int len);
void thresholding(float y[], int signals[], int lag, float threshold, float influence);


void thresholding(float y[], int signals[], int lag, float threshold, float influence) {
    memset(signals, 0, sizeof(int) * SAMPLE_LENGTH);
    float filteredY[SAMPLE_LENGTH];
    memcpy(filteredY, y, sizeof(float) * SAMPLE_LENGTH);
    float avgFilter[SAMPLE_LENGTH];
    float stdFilter[SAMPLE_LENGTH];

    avgFilter[lag - 1] = mean(y, lag);
    stdFilter[lag - 1] = stddev(y, lag);

    for (int i = lag; i < SAMPLE_LENGTH; i++) {
        if (fabsf(y[i] - avgFilter[i-1]) > threshold * stdFilter[i-1]) {
            if (y[i] > avgFilter[i-1]) {
                signals[i] = 1;
            } else {
                signals[i] = -1;
            }
            filteredY[i] = influence * y[i] + (1 - influence) * filteredY[i-1];
        } else {
            signals[i] = 0;
        }
        avgFilter[i] = mean(filteredY + i-lag, lag);
        stdFilter[i] = stddev(filteredY + i-lag, lag);
    }
}

float mean(float data[], int len) {
    float sum = 0.0, mean = 0.0;

    int i;
    for(i=0; i<len; ++i) {
        sum += data[i];
    }

    mean = sum/len;
    return mean;


}

float stddev(float data[], int len) {
    float the_mean = mean(data, len);
    float standardDeviation = 0.0;

    int i;
    for(i=0; i<len; ++i) {
        standardDeviation += pow(data[i] - the_mean, 2);
    }

    return sqrt(standardDeviation/len);
}

int main() {
    printf("Hello, World!\n");
    int lag = 100;
    float threshold = 5;
    float influence = 0;
    float y[]=  {1,1,1.1,1,0.9,1,1,1.1,1,0.9,1,1.1,1,1,0.9,1,1,1.1,1,1,1,1,1.1,0.9,1,1.1,1,1,0.9,
  ....
1,1.1,1,1,1.1,1,0.8,0.9,1,1.2,0.9,1,1,1.1,1.2,1,1.5,1,3,2,5,3,2,1,1,1,0.9,1,1,3,       2.6,4,3,3.2,2,1,1,0.8,4,4,2,2.5,1,1,1,1.2,1,1.5,1,3,2,5,3,2,1,1,1,0.9,1,1,3,
       2.6,4,3,3.2,2,1,1,0.8,4,4,2,2.5,1,1,1}

    int signal[SAMPLE_LENGTH];

    thresholding(y, signal,  lag, threshold, influence);

    return 0;
}

Hers's the result with influence = 0

enter image description here

Not great but here with influence = 1

enter image description here

which is very good.

DavidC
  • 1,842
  • 1
  • 18
  • 32
  • Hi, this is a comment I composed over a year ago, but didn't have enough points to post... I'm not still 100% familiar with my past observations, but here it goes. If I doesn't make much sense, I will re-test it. The comment was: "I suspect that the current implementation does not take into account the immediately prior value for the average and stddev filters. For example, with lag = 5, for i = 6, the average of [0,4] (inclusive) is used instead of [1,5] (or perhaps [0,5]?). I would suggest changing '(filteredY + i-lag, lag)' to '(filteredY + i-lag + 1, lag)'". – gkatev Jul 05 '20 at 12:03
  • In the first line of `thresholding` function, you should be considering the size of an int. So instead of `memset(signals, 0, sizeof(float) * SAMPLE_LENGTH)`, the correct code is `memset(signals, 0, sizeof(int) * SAMPLE_LENGTH)`. – Aleksai Losey Oct 14 '20 at 17:10
9

Here is an implementation of the Smoothed z-score algorithm (above) in Golang. It assumes a slice of []int16 (PCM 16bit samples). You can find a gist here.

/*
Settings (the ones below are examples: choose what is best for your data)
set lag to 5;          # lag 5 for the smoothing functions
set threshold to 3.5;  # 3.5 standard deviations for signal
set influence to 0.5;  # between 0 and 1, where 1 is normal influence, 0.5 is half
*/

// ZScore on 16bit WAV samples
func ZScore(samples []int16, lag int, threshold float64, influence float64) (signals []int16) {
    //lag := 20
    //threshold := 3.5
    //influence := 0.5

    signals = make([]int16, len(samples))
    filteredY := make([]int16, len(samples))
    for i, sample := range samples[0:lag] {
        filteredY[i] = sample
    }
    avgFilter := make([]int16, len(samples))
    stdFilter := make([]int16, len(samples))

    avgFilter[lag] = Average(samples[0:lag])
    stdFilter[lag] = Std(samples[0:lag])

    for i := lag + 1; i < len(samples); i++ {

        f := float64(samples[i])

        if float64(Abs(samples[i]-avgFilter[i-1])) > threshold*float64(stdFilter[i-1]) {
            if samples[i] > avgFilter[i-1] {
                signals[i] = 1
            } else {
                signals[i] = -1
            }
            filteredY[i] = int16(influence*f + (1-influence)*float64(filteredY[i-1]))
            avgFilter[i] = Average(filteredY[(i - lag):i])
            stdFilter[i] = Std(filteredY[(i - lag):i])
        } else {
            signals[i] = 0
            filteredY[i] = samples[i]
            avgFilter[i] = Average(filteredY[(i - lag):i])
            stdFilter[i] = Std(filteredY[(i - lag):i])
        }
    }

    return
}

// Average a chunk of values
func Average(chunk []int16) (avg int16) {
    var sum int64
    for _, sample := range chunk {
        if sample < 0 {
            sample *= -1
        }
        sum += int64(sample)
    }
    return int16(sum / int64(len(chunk)))
}
Xeoncross
  • 55,620
  • 80
  • 262
  • 364
8

Here is an actual Java implementation based on the Groovy answer posted earlier. (I know there are already Groovy and Kotlin implementations posted, but for someone like me who's only done Java, it's a real hassle to figure out how to convert between other languages and Java).

(Results match with other people's graphs)

Algorithm implementation

import java.util.ArrayList;
import java.util.Collections;
import java.util.HashMap;
import java.util.List;

import org.apache.commons.math3.stat.descriptive.SummaryStatistics;

public class SignalDetector {

    public HashMap<String, List> analyzeDataForSignals(List<Double> data, int lag, Double threshold, Double influence) {

        // init stats instance
        SummaryStatistics stats = new SummaryStatistics();

        // the results (peaks, 1 or -1) of our algorithm
        List<Integer> signals = new ArrayList<Integer>(Collections.nCopies(data.size(), 0));

        // filter out the signals (peaks) from our original list (using influence arg)
        List<Double> filteredData = new ArrayList<Double>(data);

        // the current average of the rolling window
        List<Double> avgFilter = new ArrayList<Double>(Collections.nCopies(data.size(), 0.0d));

        // the current standard deviation of the rolling window
        List<Double> stdFilter = new ArrayList<Double>(Collections.nCopies(data.size(), 0.0d));

        // init avgFilter and stdFilter
        for (int i = 0; i < lag; i++) {
            stats.addValue(data.get(i));
        }
        avgFilter.set(lag - 1, stats.getMean());
        stdFilter.set(lag - 1, Math.sqrt(stats.getPopulationVariance())); // getStandardDeviation() uses sample variance
        stats.clear();

        // loop input starting at end of rolling window
        for (int i = lag; i < data.size(); i++) {

            // if the distance between the current value and average is enough standard deviations (threshold) away
            if (Math.abs((data.get(i) - avgFilter.get(i - 1))) > threshold * stdFilter.get(i - 1)) {

                // this is a signal (i.e. peak), determine if it is a positive or negative signal
                if (data.get(i) > avgFilter.get(i - 1)) {
                    signals.set(i, 1);
                } else {
                    signals.set(i, -1);
                }

                // filter this signal out using influence
                filteredData.set(i, (influence * data.get(i)) + ((1 - influence) * filteredData.get(i - 1)));
            } else {
                // ensure this signal remains a zero
                signals.set(i, 0);
                // ensure this value is not filtered
                filteredData.set(i, data.get(i));
            }

            // update rolling average and deviation
            for (int j = i - lag; j < i; j++) {
                stats.addValue(filteredData.get(j));
            }
            avgFilter.set(i, stats.getMean());
            stdFilter.set(i, Math.sqrt(stats.getPopulationVariance()));
            stats.clear();
        }

        HashMap<String, List> returnMap = new HashMap<String, List>();
        returnMap.put("signals", signals);
        returnMap.put("filteredData", filteredData);
        returnMap.put("avgFilter", avgFilter);
        returnMap.put("stdFilter", stdFilter);

        return returnMap;

    } // end
}

Main method

import java.text.DecimalFormat;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashMap;
import java.util.List;

public class Main {

    public static void main(String[] args) throws Exception {
        DecimalFormat df = new DecimalFormat("#0.000");

        ArrayList<Double> data = new ArrayList<Double>(Arrays.asList(1d, 1d, 1.1d, 1d, 0.9d, 1d, 1d, 1.1d, 1d, 0.9d, 1d,
                1.1d, 1d, 1d, 0.9d, 1d, 1d, 1.1d, 1d, 1d, 1d, 1d, 1.1d, 0.9d, 1d, 1.1d, 1d, 1d, 0.9d, 1d, 1.1d, 1d, 1d,
                1.1d, 1d, 0.8d, 0.9d, 1d, 1.2d, 0.9d, 1d, 1d, 1.1d, 1.2d, 1d, 1.5d, 1d, 3d, 2d, 5d, 3d, 2d, 1d, 1d, 1d,
                0.9d, 1d, 1d, 3d, 2.6d, 4d, 3d, 3.2d, 2d, 1d, 1d, 0.8d, 4d, 4d, 2d, 2.5d, 1d, 1d, 1d));

        SignalDetector signalDetector = new SignalDetector();
        int lag = 30;
        double threshold = 5;
        double influence = 0;

        HashMap<String, List> resultsMap = signalDetector.analyzeDataForSignals(data, lag, threshold, influence);
        // print algorithm params
        System.out.println("lag: " + lag + "\t\tthreshold: " + threshold + "\t\tinfluence: " + influence);

        System.out.println("Data size: " + data.size());
        System.out.println("Signals size: " + resultsMap.get("signals").size());

        // print data
        System.out.print("Data:\t\t");
        for (double d : data) {
            System.out.print(df.format(d) + "\t");
        }
        System.out.println();

        // print signals
        System.out.print("Signals:\t");
        List<Integer> signalsList = resultsMap.get("signals");
        for (int i : signalsList) {
            System.out.print(df.format(i) + "\t");
        }
        System.out.println();

        // print filtered data
        System.out.print("Filtered Data:\t");
        List<Double> filteredDataList = resultsMap.get("filteredData");
        for (double d : filteredDataList) {
            System.out.print(df.format(d) + "\t");
        }
        System.out.println();

        // print running average
        System.out.print("Avg Filter:\t");
        List<Double> avgFilterList = resultsMap.get("avgFilter");
        for (double d : avgFilterList) {
            System.out.print(df.format(d) + "\t");
        }
        System.out.println();

        // print running std
        System.out.print("Std filter:\t");
        List<Double> stdFilterList = resultsMap.get("stdFilter");
        for (double d : stdFilterList) {
            System.out.print(df.format(d) + "\t");
        }
        System.out.println();

        System.out.println();
        for (int i = 0; i < signalsList.size(); i++) {
            if (signalsList.get(i) != 0) {
                System.out.println("Point " + i + " gave signal " + signalsList.get(i));
            }
        }
    }
}

Results

lag: 30     threshold: 5.0      influence: 0.0
Data size: 74
Signals size: 74
Data:           1.000   1.000   1.100   1.000   0.900   1.000   1.000   1.100   1.000   0.900   1.000   1.100   1.000   1.000   0.900   1.000   1.000   1.100   1.000   1.000   1.000   1.000   1.100   0.900   1.000   1.100   1.000   1.000   0.900   1.000   1.100   1.000   1.000   1.100   1.000   0.800   0.900   1.000   1.200   0.900   1.000   1.000   1.100   1.200   1.000   1.500   1.000   3.000   2.000   5.000   3.000   2.000   1.000   1.000   1.000   0.900   1.000   1.000   3.000   2.600   4.000   3.000   3.200   2.000   1.000   1.000   0.800   4.000   4.000   2.000   2.500   1.000   1.000   1.000   
Signals:        0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   1.000   0.000   1.000   1.000   1.000   1.000   1.000   0.000   0.000   0.000   0.000   0.000   0.000   1.000   1.000   1.000   1.000   1.000   1.000   0.000   0.000   0.000   1.000   1.000   1.000   1.000   0.000   0.000   0.000   
Filtered Data:  1.000   1.000   1.100   1.000   0.900   1.000   1.000   1.100   1.000   0.900   1.000   1.100   1.000   1.000   0.900   1.000   1.000   1.100   1.000   1.000   1.000   1.000   1.100   0.900   1.000   1.100   1.000   1.000   0.900   1.000   1.100   1.000   1.000   1.100   1.000   0.800   0.900   1.000   1.200   0.900   1.000   1.000   1.100   1.200   1.000   1.000   1.000   1.000   1.000   1.000   1.000   1.000   1.000   1.000   1.000   0.900   1.000   1.000   1.000   1.000   1.000   1.000   1.000   1.000   1.000   1.000   0.800   0.800   0.800   0.800   0.800   1.000   1.000   1.000   
Avg Filter:     0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   1.003   1.003   1.007   1.007   1.003   1.007   1.010   1.003   1.000   0.997   1.003   1.003   1.003   1.000   1.003   1.010   1.013   1.013   1.013   1.010   1.010   1.010   1.010   1.010   1.007   1.010   1.010   1.003   1.003   1.003   1.007   1.007   1.003   1.003   1.003   1.000   1.000   1.007   1.003   0.997   0.983   0.980   0.973   0.973   0.970   
Std filter:     0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.060   0.060   0.063   0.063   0.060   0.063   0.060   0.071   0.073   0.071   0.080   0.080   0.080   0.077   0.080   0.087   0.085   0.085   0.085   0.083   0.083   0.083   0.083   0.083   0.081   0.079   0.079   0.080   0.080   0.080   0.077   0.077   0.075   0.075   0.075   0.073   0.073   0.063   0.071   0.080   0.078   0.083   0.089   0.089   0.086   

Point 45 gave signal 1
Point 47 gave signal 1
Point 48 gave signal 1
Point 49 gave signal 1
Point 50 gave signal 1
Point 51 gave signal 1
Point 58 gave signal 1
Point 59 gave signal 1
Point 60 gave signal 1
Point 61 gave signal 1
Point 62 gave signal 1
Point 63 gave signal 1
Point 67 gave signal 1
Point 68 gave signal 1
Point 69 gave signal 1
Point 70 gave signal 1

Graphs showing data and results of java execution

takanuva15
  • 1,286
  • 1
  • 15
  • 27
  • 1
    What about when you add data not as a list just add one by one for streaming data? – C.T Apr 17 '21 at 19:07
  • 1
    @C.T I haven't tested it out, but it looks like you'll have to run the stuff in the `for (int i = lag...` loop each time you get a new value. You can see [delica's answer](https://stackoverflow.com/a/56451135/6854489) for an example of real-time streaming in Python for inspiration. – takanuva15 Apr 19 '21 at 12:31
7

C++ Implementation

#include <iostream>
#include <vector>
#include <algorithm>
#include <unordered_map>
#include <cmath>
#include <iterator>
#include <numeric>

using namespace std;

typedef long double ld;
typedef unsigned int uint;
typedef std::vector<ld>::iterator vec_iter_ld;

/**
 * Overriding the ostream operator for pretty printing vectors.
 */
template<typename T>
std::ostream &operator<<(std::ostream &os, std::vector<T> vec) {
    os << "[";
    if (vec.size() != 0) {
        std::copy(vec.begin(), vec.end() - 1, std::ostream_iterator<T>(os, " "));
        os << vec.back();
    }
    os << "]";
    return os;
}

/**
 * This class calculates mean and standard deviation of a subvector.
 * This is basically stats computation of a subvector of a window size qual to "lag".
 */
class VectorStats {
public:
    /**
     * Constructor for VectorStats class.
     *
     * @param start - This is the iterator position of the start of the window,
     * @param end   - This is the iterator position of the end of the window,
     */
    VectorStats(vec_iter_ld start, vec_iter_ld end) {
        this->start = start;
        this->end = end;
        this->compute();
    }

    /**
     * This method calculates the mean and standard deviation using STL function.
     * This is the Two-Pass implementation of the Mean & Variance calculation.
     */
    void compute() {
        ld sum = std::accumulate(start, end, 0.0);
        uint slice_size = std::distance(start, end);
        ld mean = sum / slice_size;
        std::vector<ld> diff(slice_size);
        std::transform(start, end, diff.begin(), [mean](ld x) { return x - mean; });
        ld sq_sum = std::inner_product(diff.begin(), diff.end(), diff.begin(), 0.0);
        ld std_dev = std::sqrt(sq_sum / slice_size);

        this->m1 = mean;
        this->m2 = std_dev;
    }

    ld mean() {
        return m1;
    }

    ld standard_deviation() {
        return m2;
    }

private:
    vec_iter_ld start;
    vec_iter_ld end;
    ld m1;
    ld m2;
};

/**
 * This is the implementation of the Smoothed Z-Score Algorithm.
 * This is direction translation of https://stackoverflow.com/a/22640362/1461896.
 *
 * @param input - input signal
 * @param lag - the lag of the moving window
 * @param threshold - the z-score at which the algorithm signals
 * @param influence - the influence (between 0 and 1) of new signals on the mean and standard deviation
 * @return a hashmap containing the filtered signal and corresponding mean and standard deviation.
 */
unordered_map<string, vector<ld>> z_score_thresholding(vector<ld> input, int lag, ld threshold, ld influence) {
    unordered_map<string, vector<ld>> output;

    uint n = (uint) input.size();
    vector<ld> signals(input.size());
    vector<ld> filtered_input(input.begin(), input.end());
    vector<ld> filtered_mean(input.size());
    vector<ld> filtered_stddev(input.size());

    VectorStats lag_subvector_stats(input.begin(), input.begin() + lag);
    filtered_mean[lag - 1] = lag_subvector_stats.mean();
    filtered_stddev[lag - 1] = lag_subvector_stats.standard_deviation();

    for (int i = lag; i < n; i++) {
        if (abs(input[i] - filtered_mean[i - 1]) > threshold * filtered_stddev[i - 1]) {
            signals[i] = (input[i] > filtered_mean[i - 1]) ? 1.0 : -1.0;
            filtered_input[i] = influence * input[i] + (1 - influence) * filtered_input[i - 1];
        } else {
            signals[i] = 0.0;
            filtered_input[i] = input[i];
        }
        VectorStats lag_subvector_stats(filtered_input.begin() + (i - lag), filtered_input.begin() + i);
        filtered_mean[i] = lag_subvector_stats.mean();
        filtered_stddev[i] = lag_subvector_stats.standard_deviation();
    }

    output["signals"] = signals;
    output["filtered_mean"] = filtered_mean;
    output["filtered_stddev"] = filtered_stddev;

    return output;
};

int main() {
    vector<ld> input = {1.0, 1.0, 1.1, 1.0, 0.9, 1.0, 1.0, 1.1, 1.0, 0.9, 1.0, 1.1, 1.0, 1.0, 0.9, 1.0, 1.0, 1.1, 1.0,
                        1.0, 1.0, 1.0, 1.1, 0.9, 1.0, 1.1, 1.0, 1.0, 0.9, 1.0, 1.1, 1.0, 1.0, 1.1, 1.0, 0.8, 0.9, 1.0,
                        1.2, 0.9, 1.0, 1.0, 1.1, 1.2, 1.0, 1.5, 1.0, 3.0, 2.0, 5.0, 3.0, 2.0, 1.0, 1.0, 1.0, 0.9, 1.0,
                        1.0, 3.0, 2.6, 4.0, 3.0, 3.2, 2.0, 1.0, 1.0, 0.8, 4.0, 4.0, 2.0, 2.5, 1.0, 1.0, 1.0};

    int lag = 30;
    ld threshold = 5.0;
    ld influence = 0.0;
    unordered_map<string, vector<ld>> output = z_score_thresholding(input, lag, threshold, influence);
    cout << output["signals"] << endl;
}
Lance
  • 399
  • 1
  • 5
  • 14
Animesh Pandey
  • 5,900
  • 13
  • 64
  • 130
6

This problem looks similar to one I encountered in a hybrid/embedded systems course, but that was related to detecting faults when the input from a sensor is noisy. We used a Kalman filter to estimate/predict the hidden state of the system, then used statistical analysis to determine the likelihood that a fault had occurred. We were working with linear systems, but nonlinear variants exist. I remember the approach being surprisingly adaptive, but it required a model of the dynamics of the system.

Peter G
  • 1,613
  • 10
  • 10
  • The Kalman filter is interesting, but I can't seem to find an applicable algorithm for my purpose. I highly appreciate the answer though and I will look into some peak detection papers [like this one](http://cdn.intechopen.com/pdfs-wm/9742.pdf) to see if I can learn from any of the algorithms. Thanks! – Jean-Paul Mar 31 '14 at 21:31
6

Thought I would provide my Julia implementation of the algorithm for others. The gist can be found here

using Statistics
using Plots
function SmoothedZscoreAlgo(y, lag, threshold, influence)
    # Julia implimentation of http://stackoverflow.com/a/22640362/6029703
    n = length(y)
    signals = zeros(n) # init signal results
    filteredY = copy(y) # init filtered series
    avgFilter = zeros(n) # init average filter
    stdFilter = zeros(n) # init std filter
    avgFilter[lag - 1] = mean(y[1:lag]) # init first value
    stdFilter[lag - 1] = std(y[1:lag]) # init first value

    for i in range(lag, stop=n-1)
        if abs(y[i] - avgFilter[i-1]) > threshold*stdFilter[i-1]
            if y[i] > avgFilter[i-1]
                signals[i] += 1 # postive signal
            else
                signals[i] += -1 # negative signal
            end
            # Make influence lower
            filteredY[i] = influence*y[i] + (1-influence)*filteredY[i-1]
        else
            signals[i] = 0
            filteredY[i] = y[i]
        end
        avgFilter[i] = mean(filteredY[i-lag+1:i])
        stdFilter[i] = std(filteredY[i-lag+1:i])
    end
    return (signals = signals, avgFilter = avgFilter, stdFilter = stdFilter)
end


# Data
y = [1,1,1.1,1,0.9,1,1,1.1,1,0.9,1,1.1,1,1,0.9,1,1,1.1,1,1,1,1,1.1,0.9,1,1.1,1,1,0.9,
       1,1.1,1,1,1.1,1,0.8,0.9,1,1.2,0.9,1,1,1.1,1.2,1,1.5,1,3,2,5,3,2,1,1,1,0.9,1,1,3,
       2.6,4,3,3.2,2,1,1,0.8,4,4,2,2.5,1,1,1]

# Settings: lag = 30, threshold = 5, influence = 0
lag = 30
threshold = 5
influence = 0

results = SmoothedZscoreAlgo(y, lag, threshold, influence)
upper_bound = results[:avgFilter] + threshold * results[:stdFilter]
lower_bound = results[:avgFilter] - threshold * results[:stdFilter]
x = 1:length(y)

yplot = plot(x,y,color="blue", label="Y",legend=:topleft)
yplot = plot!(x,upper_bound, color="green", label="Upper Bound",legend=:topleft)
yplot = plot!(x,results[:avgFilter], color="cyan", label="Average Filter",legend=:topleft)
yplot = plot!(x,lower_bound, color="green", label="Lower Bound",legend=:topleft)
signalplot = plot(x,results[:signals],color="red",label="Signals",legend=:topleft)
plot(yplot,signalplot,layout=(2,1),legend=:topleft)

Results

Matt Camp
  • 1,448
  • 3
  • 17
  • 38
5

Here is my attempt at creating a Ruby solution for the "Smoothed z-score algo" from the accepted answer:

module ThresholdingAlgoMixin
  def mean(array)
    array.reduce(&:+) / array.size.to_f
  end

  def stddev(array)
    array_mean = mean(array)
    Math.sqrt(array.reduce(0.0) { |a, b| a.to_f + ((b.to_f - array_mean) ** 2) } / array.size.to_f)
  end

  def thresholding_algo(lag: 5, threshold: 3.5, influence: 0.5)
    return nil if size < lag * 2
    Array.new(size, 0).tap do |signals|
      filtered = Array.new(self)

      initial_slice = take(lag)
      avg_filter = Array.new(lag - 1, 0.0) + [mean(initial_slice)]
      std_filter = Array.new(lag - 1, 0.0) + [stddev(initial_slice)]
      (lag..size-1).each do |idx|
        prev = idx - 1
        if (fetch(idx) - avg_filter[prev]).abs > threshold * std_filter[prev]
          signals[idx] = fetch(idx) > avg_filter[prev] ? 1 : -1
          filtered[idx] = (influence * fetch(idx)) + ((1-influence) * filtered[prev])
        end

        filtered_slice = filtered[idx-lag..prev]
        avg_filter[idx] = mean(filtered_slice)
        std_filter[idx] = stddev(filtered_slice)
      end
    end
  end
end

And example usage:

test_data = [
  1, 1, 1.1, 1, 0.9, 1, 1, 1.1, 1, 0.9, 1, 1.1, 1, 1, 0.9, 1,
  1, 1.1, 1, 1, 1, 1, 1.1, 0.9, 1, 1.1, 1, 1, 0.9, 1, 1.1, 1,
  1, 1.1, 1, 0.8, 0.9, 1, 1.2, 0.9, 1, 1, 1.1, 1.2, 1, 1.5,
  1, 3, 2, 5, 3, 2, 1, 1, 1, 0.9, 1, 1, 3, 2.6, 4, 3, 3.2, 2,
  1, 1, 0.8, 4, 4, 2, 2.5, 1, 1, 1
].extend(ThresholdingAlgoMixin)

puts test_data.thresholding_algo.inspect

# Output: [
#   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
#   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, 0, 0, 0,
#   0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1,
#   1, 1, 0, 0, 0, -1, -1, 0, 0, 0, 0, 0, 0, 0, 0
# ]
Kimmo Lehto
  • 5,910
  • 1
  • 23
  • 32
4

Here is an altered Fortran version of the z-score algorithm. It is altered specifically for peak (resonance) detection in transfer functions in frequency space (Each change has a small comment in code).

The first modification gives a warning to the user if there is a resonance near the lower bound of the input vector, indicated by a standard deviation higher than a certain threshold (10% in this case). This simply means the signal is not flat enough for the detection initializing the filters properly.

The second modification is that only the highest value of a peak is added to the found peaks. This is reached by comparing each found peak value to the magnitude of its (lag) predecessors and its (lag) successors.

The third change is to respect that resonance peaks usually show some form of symmetry around the resonance frequency. So it is natural to calculate the mean and std symmetrically around the current data point (rather than just for the predecessors). This results in a better peak detection behavior.

The modifications have the effect that the whole signal has to be known to the function beforehand which is the usual case for resonance detection (something like the Matlab Example of Jean-Paul where the data points are generated on the fly won't work).

function PeakDetect(y,lag,threshold, influence)
    implicit none
    ! Declaring part
    real, dimension(:), intent(in) :: y
    integer, dimension(size(y)) :: PeakDetect
    real, dimension(size(y)) :: filteredY, avgFilter, stdFilter
    integer :: lag, ii
    real :: threshold, influence

    ! Executing part
    PeakDetect = 0
    filteredY = 0.0
    filteredY(1:lag+1) = y(1:lag+1)
    avgFilter = 0.0
    avgFilter(lag+1) = mean(y(1:2*lag+1))
    stdFilter = 0.0
    stdFilter(lag+1) = std(y(1:2*lag+1))

    if (stdFilter(lag+1)/avgFilter(lag+1)>0.1) then ! If the coefficient of variation exceeds 10%, the signal is too uneven at the start, possibly because of a peak.
        write(unit=*,fmt=1001)
1001        format(1X,'Warning: Peak detection might have failed, as there may be a peak at the edge of the frequency range.',/)
    end if
    do ii = lag+2, size(y)
        if (abs(y(ii) - avgFilter(ii-1)) > threshold * stdFilter(ii-1)) then
            ! Find only the largest outstanding value which is only the one greater than its predecessor and its successor
            if (y(ii) > avgFilter(ii-1) .AND. y(ii) > y(ii-1) .AND. y(ii) > y(ii+1)) then
                PeakDetect(ii) = 1
            end if
            filteredY(ii) = influence * y(ii) + (1 - influence) * filteredY(ii-1)
        else
            filteredY(ii) = y(ii)
        end if
        ! Modified with respect to the original code. Mean and standard deviation are calculted symmetrically around the current point
        avgFilter(ii) = mean(filteredY(ii-lag:ii+lag))
        stdFilter(ii) = std(filteredY(ii-lag:ii+lag))
    end do
end function PeakDetect

real function mean(y)
    !> @brief Calculates the mean of vector y
    implicit none
    ! Declaring part
    real, dimension(:), intent(in) :: y
    integer :: N
    ! Executing part
    N = max(1,size(y))
    mean = sum(y)/N
end function mean

real function std(y)
    !> @brief Calculates the standard deviation of vector y
    implicit none
    ! Declaring part
    real, dimension(:), intent(in) :: y
    integer :: N
    ! Executing part
    N = max(1,size(y))
    std = sqrt((N*dot_product(y,y) - sum(y)**2) / (N*(N-1)))
end function std

For my application the algorithm works like a charm! enter image description here

THo
  • 101
  • 8
4

An iterative version in python/numpy for answer https://stackoverflow.com/a/22640362/6029703 is here. This code is faster than computing average and standard deviation every lag for large data (100000+).

def peak_detection_smoothed_zscore_v2(x, lag, threshold, influence):
    '''
    iterative smoothed z-score algorithm
    Implementation of algorithm from https://stackoverflow.com/a/22640362/6029703
    '''
    import numpy as np
    labels = np.zeros(len(x))
    filtered_y = np.array(x)
    avg_filter = np.zeros(len(x))
    std_filter = np.zeros(len(x))
    var_filter = np.zeros(len(x))

    avg_filter[lag - 1] = np.mean(x[0:lag])
    std_filter[lag - 1] = np.std(x[0:lag])
    var_filter[lag - 1] = np.var(x[0:lag])
    for i in range(lag, len(x)):
        if abs(x[i] - avg_filter[i - 1]) > threshold * std_filter[i - 1]:
            if x[i] > avg_filter[i - 1]:
                labels[i] = 1
            else:
                labels[i] = -1
            filtered_y[i] = influence * x[i] + (1 - influence) * filtered_y[i - 1]
        else:
            labels[i] = 0
            filtered_y[i] = x[i]
        # update avg, var, std
        avg_filter[i] = avg_filter[i - 1] + 1. / lag * (filtered_y[i] - filtered_y[i - lag])
        var_filter[i] = var_filter[i - 1] + 1. / lag * ((filtered_y[i] - avg_filter[i - 1]) ** 2 - (
            filtered_y[i - lag] - avg_filter[i - 1]) ** 2 - (filtered_y[i] - filtered_y[i - lag]) ** 2 / lag)
        std_filter[i] = np.sqrt(var_filter[i])

    return dict(signals=labels,
                avgFilter=avg_filter,
                stdFilter=std_filter)
Tranfer Will
  • 118
  • 1
  • 5
4

I allowed myself to create a javascript version of it. Might it be helpful. The javascript should be direct transcription of the Pseudocode given above. Available as npm package and github repo:

Javascript translation:

// javascript port of: https://stackoverflow.com/questions/22583391/peak-signal-detection-in-realtime-timeseries-data/48895639#48895639

function sum(a) {
    return a.reduce((acc, val) => acc + val)
}

function mean(a) {
    return sum(a) / a.length
}

function stddev(arr) {
    const arr_mean = mean(arr)
    const r = function(acc, val) {
        return acc + ((val - arr_mean) * (val - arr_mean))
    }
    return Math.sqrt(arr.reduce(r, 0.0) / arr.length)
}

function smoothed_z_score(y, params) {
    var p = params || {}
    // init cooefficients
    const lag = p.lag || 5
    const threshold = p.threshold || 3.5
    const influence = p.influece || 0.5

    if (y === undefined || y.length < lag + 2) {
        throw ` ## y data array to short(${y.length}) for given lag of ${lag}`
    }
    //console.log(`lag, threshold, influence: ${lag}, ${threshold}, ${influence}`)

    // init variables
    var signals = Array(y.length).fill(0)
    var filteredY = y.slice(0)
    const lead_in = y.slice(0, lag)
    //console.log("1: " + lead_in.toString())

    var avgFilter = []
    avgFilter[lag - 1] = mean(lead_in)
    var stdFilter = []
    stdFilter[lag - 1] = stddev(lead_in)
    //console.log("2: " + stdFilter.toString())

    for (var i = lag; i < y.length; i++) {
        //console.log(`${y[i]}, ${avgFilter[i-1]}, ${threshold}, ${stdFilter[i-1]}`)
        if (Math.abs(y[i] - avgFilter[i - 1]) > (threshold * stdFilter[i - 1])) {
            if (y[i] > avgFilter[i - 1]) {
                signals[i] = +1 // positive signal
            } else {
                signals[i] = -1 // negative signal
            }
            // make influence lower
            filteredY[i] = influence * y[i] + (1 - influence) * filteredY[i - 1]
        } else {
            signals[i] = 0 // no signal
            filteredY[i] = y[i]
        }

        // adjust the filters
        const y_lag = filteredY.slice(i - lag, i)
        avgFilter[i] = mean(y_lag)
        stdFilter[i] = stddev(y_lag)
    }

    return signals
}

module.exports = smoothed_z_score
Community
  • 1
  • 1
  • By now, I have ported some other algorithm to javascript. This time from numercial pyhon, which give me more control and works better for me. Also packaged in npm and you can find more info on the algo from washington state university on their jupyter page the have for it. https://www.npmjs.com/package/@joe_six/duarte-watanabe-peak-detection – Dirk Lüsebrink Oct 24 '19 at 09:53
  • 2
    I believe this code is actually incorrect. `filteredY.slice(i - lag, i)` should be `filteredY.slice(i - lag + 1, i + 1)`, otherwise we will be lagging by 1 period. – Maarten Apr 10 '23 at 14:55
  • umpf, that's bad, but I can check for it, or update that. I am not anywhere even near that code, and would take quite a while reproduce, check, update it. nothing I can do about it at the moment, – Dirk Lüsebrink Apr 11 '23 at 15:18
3

If the boundary value or other criteria depends on future values, then the only solution (without a time-machine, or other knowledge of future values) is to delay any decision until one has sufficient future values. If you want a level above a mean that spans, for example, 20 points, then you have to wait until you have at least 19 points ahead of any peak decision, or else the next new point could completely throw off your threshold 19 points ago.

Added: If the statistical distribution of the peak heights could be heavy tailed, instead of Uniform or Gaussian, then you may need to wait until you see several thousand peaks before it starts to become unlikely that a hidden Pareto distribution won't produce a peak many times larger than any you currently have seen before or have in your current plot. Unless you somehow know in advance that the very next point can't be 1e20, it could appear, which after rescaling your plot's Y dimension, would be flat up until that point.

hotpaw2
  • 70,107
  • 14
  • 90
  • 153
  • Like I said before, we can assume that IF a peak occurs, it is as large as the peaks in the picture and deviates significantly from the 'normal' values. – Jean-Paul Mar 24 '14 at 09:53
  • If you know how large the peaks will be in advance, then pre-set your mean and/or threshold to just under that value. – hotpaw2 Mar 24 '14 at 15:00
  • 1
    And that's exactly what I don't know in advance. – Jean-Paul Mar 24 '14 at 15:25
  • So we can assume that peaks occur above a certain threshold and that they deviate significantly from the 'other values'. Question asks about an algorithm which can identify large deviating peaks from the normal values. – Jean-Paul Mar 24 '14 at 15:27
  • 1
    You just contradicted yourself and wrote that the peaks are known to be the size in the picture. Either you know that or you don't. – hotpaw2 Mar 24 '14 at 15:41
  • 3
    I'm trying to explain it to you. You get the idea now right? 'How to identify significantly large peaks'. You can approach the problem either statistically or with a smart algorithm. With `.. As large as in the picture` I meant: for similar situations where there are significant peaks and basic noise. – Jean-Paul Mar 24 '14 at 15:59
3

Here is a Groovy (Java) implementation of the smoothed z-score algorithm (see answer above).

/**
 * "Smoothed zero-score alogrithm" shamelessly copied from https://stackoverflow.com/a/22640362/6029703
 *  Uses a rolling mean and a rolling deviation (separate) to identify peaks in a vector
 *
 * @param y - The input vector to analyze
 * @param lag - The lag of the moving window (i.e. how big the window is)
 * @param threshold - The z-score at which the algorithm signals (i.e. how many standard deviations away from the moving mean a peak (or signal) is)
 * @param influence - The influence (between 0 and 1) of new signals on the mean and standard deviation (how much a peak (or signal) should affect other values near it)
 * @return - The calculated averages (avgFilter) and deviations (stdFilter), and the signals (signals)
 */

public HashMap<String, List<Object>> thresholdingAlgo(List<Double> y, Long lag, Double threshold, Double influence) {
    //init stats instance
    SummaryStatistics stats = new SummaryStatistics()

    //the results (peaks, 1 or -1) of our algorithm
    List<Integer> signals = new ArrayList<Integer>(Collections.nCopies(y.size(), 0))
    //filter out the signals (peaks) from our original list (using influence arg)
    List<Double> filteredY = new ArrayList<Double>(y)
    //the current average of the rolling window
    List<Double> avgFilter = new ArrayList<Double>(Collections.nCopies(y.size(), 0.0d))
    //the current standard deviation of the rolling window
    List<Double> stdFilter = new ArrayList<Double>(Collections.nCopies(y.size(), 0.0d))
    //init avgFilter and stdFilter
    (0..lag-1).each { stats.addValue(y[it as int]) }
    avgFilter[lag - 1 as int] = stats.getMean()
    stdFilter[lag - 1 as int] = Math.sqrt(stats.getPopulationVariance()) //getStandardDeviation() uses sample variance (not what we want)
    stats.clear()
    //loop input starting at end of rolling window
    (lag..y.size()-1).each { i ->
        //if the distance between the current value and average is enough standard deviations (threshold) away
        if (Math.abs((y[i as int] - avgFilter[i - 1 as int]) as Double) > threshold * stdFilter[i - 1 as int]) {
            //this is a signal (i.e. peak), determine if it is a positive or negative signal
            signals[i as int] = (y[i as int] > avgFilter[i - 1 as int]) ? 1 : -1
            //filter this signal out using influence
            filteredY[i as int] = (influence * y[i as int]) + ((1-influence) * filteredY[i - 1 as int])
        } else {
            //ensure this signal remains a zero
            signals[i as int] = 0
            //ensure this value is not filtered
            filteredY[i as int] = y[i as int]
        }
        //update rolling average and deviation
        (i - lag..i-1).each { stats.addValue(filteredY[it as int] as Double) }
        avgFilter[i as int] = stats.getMean()
        stdFilter[i as int] = Math.sqrt(stats.getPopulationVariance()) //getStandardDeviation() uses sample variance (not what we want)
        stats.clear()
    }

    return [
        signals  : signals,
        avgFilter: avgFilter,
        stdFilter: stdFilter
    ]
}

Below is a test on the same dataset that yields the same results as the above Python / numpy implementation.

    // Data
    def y = [1d, 1d, 1.1d, 1d, 0.9d, 1d, 1d, 1.1d, 1d, 0.9d, 1d, 1.1d, 1d, 1d, 0.9d, 1d, 1d, 1.1d, 1d, 1d,
         1d, 1d, 1.1d, 0.9d, 1d, 1.1d, 1d, 1d, 0.9d, 1d, 1.1d, 1d, 1d, 1.1d, 1d, 0.8d, 0.9d, 1d, 1.2d, 0.9d, 1d,
         1d, 1.1d, 1.2d, 1d, 1.5d, 1d, 3d, 2d, 5d, 3d, 2d, 1d, 1d, 1d, 0.9d, 1d,
         1d, 3d, 2.6d, 4d, 3d, 3.2d, 2d, 1d, 1d, 0.8d, 4d, 4d, 2d, 2.5d, 1d, 1d, 1d]

    // Settings
    def lag = 30
    def threshold = 5
    def influence = 0


    def thresholdingResults = thresholdingAlgo((List<Double>) y, (Long) lag, (Double) threshold, (Double) influence)

    println y.size()
    println thresholdingResults.signals.size()
    println thresholdingResults.signals

    thresholdingResults.signals.eachWithIndex { x, idx ->
        if (x) {
            println y[idx]
        }
    }
3

Here is a (non-idiomatic) Scala version of the smoothed z-score algorithm:

/**
  * Smoothed zero-score alogrithm shamelessly copied from https://stackoverflow.com/a/22640362/6029703
  * Uses a rolling mean and a rolling deviation (separate) to identify peaks in a vector
  *
  * @param y - The input vector to analyze
  * @param lag - The lag of the moving window (i.e. how big the window is)
  * @param threshold - The z-score at which the algorithm signals (i.e. how many standard deviations away from the moving mean a peak (or signal) is)
  * @param influence - The influence (between 0 and 1) of new signals on the mean and standard deviation (how much a peak (or signal) should affect other values near it)
  * @return - The calculated averages (avgFilter) and deviations (stdFilter), and the signals (signals)
  */
private def smoothedZScore(y: Seq[Double], lag: Int, threshold: Double, influence: Double): Seq[Int] = {
  val stats = new SummaryStatistics()

  // the results (peaks, 1 or -1) of our algorithm
  val signals = mutable.ArrayBuffer.fill(y.length)(0)

  // filter out the signals (peaks) from our original list (using influence arg)
  val filteredY = y.to[mutable.ArrayBuffer]

  // the current average of the rolling window
  val avgFilter = mutable.ArrayBuffer.fill(y.length)(0d)

  // the current standard deviation of the rolling window
  val stdFilter = mutable.ArrayBuffer.fill(y.length)(0d)

  // init avgFilter and stdFilter
  y.take(lag).foreach(s => stats.addValue(s))

  avgFilter(lag - 1) = stats.getMean
  stdFilter(lag - 1) = Math.sqrt(stats.getPopulationVariance) // getStandardDeviation() uses sample variance (not what we want)

  // loop input starting at end of rolling window
  y.zipWithIndex.slice(lag, y.length - 1).foreach {
    case (s: Double, i: Int) =>
      // if the distance between the current value and average is enough standard deviations (threshold) away
      if (Math.abs(s - avgFilter(i - 1)) > threshold * stdFilter(i - 1)) {
        // this is a signal (i.e. peak), determine if it is a positive or negative signal
        signals(i) = if (s > avgFilter(i - 1)) 1 else -1
        // filter this signal out using influence
        filteredY(i) = (influence * s) + ((1 - influence) * filteredY(i - 1))
      } else {
        // ensure this signal remains a zero
        signals(i) = 0
        // ensure this value is not filtered
        filteredY(i) = s
      }

      // update rolling average and deviation
      stats.clear()
      filteredY.slice(i - lag, i).foreach(s => stats.addValue(s))
      avgFilter(i) = stats.getMean
      stdFilter(i) = Math.sqrt(stats.getPopulationVariance) // getStandardDeviation() uses sample variance (not what we want)
  }

  println(y.length)
  println(signals.length)
  println(signals)

  signals.zipWithIndex.foreach {
    case(x: Int, idx: Int) =>
      if (x == 1) {
        println(idx + " " + y(idx))
      }
  }

  val data =
    y.zipWithIndex.map { case (s: Double, i: Int) => Map("x" -> i, "y" -> s, "name" -> "y", "row" -> "data") } ++
    avgFilter.zipWithIndex.map { case (s: Double, i: Int) => Map("x" -> i, "y" -> s, "name" -> "avgFilter", "row" -> "data") } ++
    avgFilter.zipWithIndex.map { case (s: Double, i: Int) => Map("x" -> i, "y" -> (s - threshold * stdFilter(i)), "name" -> "lower", "row" -> "data") } ++
    avgFilter.zipWithIndex.map { case (s: Double, i: Int) => Map("x" -> i, "y" -> (s + threshold * stdFilter(i)), "name" -> "upper", "row" -> "data") } ++
    signals.zipWithIndex.map { case (s: Int, i: Int) => Map("x" -> i, "y" -> s, "name" -> "signal", "row" -> "signal") }

  Vegas("Smoothed Z")
    .withData(data)
    .mark(Line)
    .encodeX("x", Quant)
    .encodeY("y", Quant)
    .encodeColor(
      field="name",
      dataType=Nominal
    )
    .encodeRow("row", Ordinal)
    .show

  return signals
}

Here's a test that returns the same results as the Python and Groovy versions:

val y = List(1d, 1d, 1.1d, 1d, 0.9d, 1d, 1d, 1.1d, 1d, 0.9d, 1d, 1.1d, 1d, 1d, 0.9d, 1d, 1d, 1.1d, 1d, 1d,
  1d, 1d, 1.1d, 0.9d, 1d, 1.1d, 1d, 1d, 0.9d, 1d, 1.1d, 1d, 1d, 1.1d, 1d, 0.8d, 0.9d, 1d, 1.2d, 0.9d, 1d,
  1d, 1.1d, 1.2d, 1d, 1.5d, 1d, 3d, 2d, 5d, 3d, 2d, 1d, 1d, 1d, 0.9d, 1d,
  1d, 3d, 2.6d, 4d, 3d, 3.2d, 2d, 1d, 1d, 0.8d, 4d, 4d, 2d, 2.5d, 1d, 1d, 1d)

val lag = 30
val threshold = 5d
val influence = 0d

smoothedZScore(y, lag, threshold, influence)

vegas chart of result

Gist here

Mike Roberts
  • 541
  • 5
  • 13
  • 2
    hello! Thanks for writing the scala version of this! I found a small bug though. It seems you don't need `y.length-1` in the slice() function. It causes the last element to be skipped. https://gist.github.com/ecopoesis/587602a39fddd7001d1ca54a16884417#file-smoothedzscore-scala-L52 . I discovered this by sprinkling log statements everywhere and noticed it. – Matthew Aug 24 '20 at 17:55
  • Thanks for providing this solution @MikeRoberts. Please update to mention that you need to import org.apache.commons.math3.stat.descriptive.SummaryStatistics as an external dependency. – Lars Skaug Jan 06 '21 at 14:19
3

I needed something like this in my android project. Thought I might give back Kotlin implementation.

import org.apache.commons.math3.stat.descriptive.SummaryStatistics
/**
* Smoothed zero-score alogrithm shamelessly copied from https://stackoverflow.com/a/22640362/6029703
* Uses a rolling mean and a rolling deviation (separate) to identify peaks in a vector
*
* @param y - The input vector to analyze
* @param lag - The lag of the moving window (i.e. how big the window is)
* @param threshold - The z-score at which the algorithm signals (i.e. how many standard deviations away from the moving mean a peak (or signal) is)
* @param influence - The influence (between 0 and 1) of new signals on the mean and standard deviation (how much a peak (or signal) should affect other values near it)
* @return - The calculated averages (avgFilter) and deviations (stdFilter), and the signals (signals)
*/
fun smoothedZScore(y: List<Double>, lag: Int, threshold: Double, influence: Double): Triple<List<Int>, List<Double>, List<Double>> {
    val stats = SummaryStatistics()
    // the results (peaks, 1 or -1) of our algorithm
    val signals = MutableList<Int>(y.size, { 0 })
    // filter out the signals (peaks) from our original list (using influence arg)
    val filteredY = ArrayList<Double>(y)
    // the current average of the rolling window
    val avgFilter = MutableList<Double>(y.size, { 0.0 })
    // the current standard deviation of the rolling window
    val stdFilter = MutableList<Double>(y.size, { 0.0 })
    // init avgFilter and stdFilter
    y.take(lag).forEach { s -> stats.addValue(s) }
    avgFilter[lag - 1] = stats.mean
    stdFilter[lag - 1] = Math.sqrt(stats.populationVariance) // getStandardDeviation() uses sample variance (not what we want)
    stats.clear()
    //loop input starting at end of rolling window
    (lag..y.size - 1).forEach { i ->
        //if the distance between the current value and average is enough standard deviations (threshold) away
        if (Math.abs(y[i] - avgFilter[i - 1]) > threshold * stdFilter[i - 1]) {
            //this is a signal (i.e. peak), determine if it is a positive or negative signal
            signals[i] = if (y[i] > avgFilter[i - 1]) 1 else -1
            //filter this signal out using influence
            filteredY[i] = (influence * y[i]) + ((1 - influence) * filteredY[i - 1])
        } else {
            //ensure this signal remains a zero
            signals[i] = 0
            //ensure this value is not filtered
            filteredY[i] = y[i]
        }
        //update rolling average and deviation
        (i - lag..i - 1).forEach { stats.addValue(filteredY[it]) }
        avgFilter[i] = stats.getMean()
        stdFilter[i] = Math.sqrt(stats.getPopulationVariance()) //getStandardDeviation() uses sample variance (not what we want)
        stats.clear()
    }
    return Triple(signals, avgFilter, stdFilter)
}

sample project with verification graphs can be found at github.

enter image description here

Stvad
  • 360
  • 2
  • 17
leonardkraemer
  • 6,573
  • 1
  • 31
  • 54
3

If you have got your data in a database table, here is a SQL version of a simple z-score algorithm:

with data_with_zscore as (
    select
        date_time,
        value,
        value / (avg(value) over ()) as pct_of_mean,
        (value - avg(value) over ()) / (stdev(value) over ()) as z_score
    from {{tablename}}  where datetime > '2018-11-26' and datetime < '2018-12-03'
)


-- select all
select * from data_with_zscore 

-- select only points greater than a certain threshold
select * from data_with_zscore where z_score > abs(2)
Ocean Airdrop
  • 2,793
  • 1
  • 26
  • 31
  • 1
    Your code does something else than the algorithm I have proposed. Your query simply calculates z-scores ([data point - mean]/ std), but doesn’t incorporate the logic of my algorithm that ignores past signals when calculating new signal thresholds. You also ignore the three parameters (lag, influence, threshold). Could you revise your answer to incorporate the actual logic? – Jean-Paul Dec 03 '18 at 15:31
  • 2
    Yes, your right. At first I thought I could get away with the above simplified version.. I have since taken your full solution and ported it to C#. See my answer below. When I have more time I will re-visit this SQL version and incorporate your algorithm. By the way, thank you for such a great answer & visual explanation. – Ocean Airdrop Dec 04 '18 at 13:55
3

The function scipy.signal.find_peaks, as its name suggests, is useful for this. But it's important to understand well its parameters width, threshold, distance and above all prominence to get a good peak extraction.

According to my tests and the documentation, the concept of prominence is "the useful concept" to keep the good peaks, and discard the noisy peaks.

What is (topographic) prominence? It is "the minimum height necessary to descend to get from the summit to any higher terrain", as it can be seen here:

The idea is:

The higher the prominence, the more "important" the peak is.

mrk
  • 8,059
  • 3
  • 56
  • 78
3

And here comes the PHP implementation of the ZSCORE algo:

<?php
$y = array(1,7,1.1,1,0.9,1,1,1.1,1,0.9,1,1.1,1,1,0.9,1,1,1.1,1,1,1,1,1.1,0.9,1,1.1,1,1,0.9,
       1,1.1,1,1,1.1,1,0.8,0.9,1,1.2,0.9,1,1,1.1,1.2,1,1.5,10,3,2,5,3,2,1,1,1,0.9,1,1,3,
       2.6,4,3,3.2,2,1,1,0.8,4,4,2,2.5,1,1,1);

function mean($data, $start, $len) {
    $avg = 0;
    for ($i = $start; $i < $start+ $len; $i ++)
        $avg += $data[$i];
    return $avg / $len;
}
    
function stddev($data, $start,$len) {
    $mean = mean($data,$start,$len);
    $dev = 0;
    for ($i = $start; $i < $start+$len; $i++) 
        $dev += (($data[$i] - $mean) * ($data[$i] - $mean));
    return sqrt($dev / $len);
}

function zscore($data, $len, $lag= 20, $threshold = 1, $influence = 1) {

    $signals = array();
    $avgFilter = array();
    $stdFilter = array();
    $filteredY = array();
    $avgFilter[$lag - 1] = mean($data, 0, $lag);
    $stdFilter[$lag - 1] = stddev($data, 0, $lag);
    
    for ($i = 0; $i < $len; $i++) {
        $filteredY[$i] = $data[$i];
        $signals[$i] = 0;
    }


    for ($i=$lag; $i < $len; $i++) {
        if (abs($data[$i] - $avgFilter[$i-1]) > $threshold * $stdFilter[$lag - 1]) {
            if ($data[$i] > $avgFilter[$i-1]) {
                $signals[$i] = 1;
            }
            else {
                $signals[$i] = -1;
            }
            $filteredY[$i] = $influence * $data[$i] + (1 - $influence) * $filteredY[$i-1];
        } 
        else {
            $signals[$i] = 0;
            $filteredY[$i] = $data[$i];
        }
        
        $avgFilter[$i] = mean($filteredY, $i - $lag, $lag);
        $stdFilter[$i] = stddev($filteredY, $i - $lag, $lag);
    }
    return $signals;
}

$sig = zscore($y, count($y));

print_r($y); echo "<br><br>";
print_r($sig); echo "<br><br>";

for ($i = 0; $i < count($y); $i++) echo $i. " " . $y[$i]. " ". $sig[$i]."<br>";
Dharman
  • 30,962
  • 25
  • 85
  • 135
radhoo
  • 2,877
  • 26
  • 27
  • 1
    One comment: given that this algorithm will mostly be used on sampled data, I suggest you implement the [sample standard deviation](https://en.wikipedia.org/wiki/Standard_deviation#Corrected_sample_standard_deviation) by dividing by `($len - 1)` instead of `$len` in `stddev()` – Jean-Paul Jan 10 '20 at 20:56
3

I think that delica's Python anwser has a bug in it. I can't comment on his post since I do not have the rep to do it and the edit queue is full so I am probably not the first one to notice it.

avgFilter[lag - 1] and stdFilter[lag - 1] are set in the init and then are being set again when lag == i instead of changing the [lag] value. This result to the first signal to always be 1.

Here is the code with the minor correction :

import numpy as np

class real_time_peak_detection():
    def __init__(self, array, lag, threshold, influence):
        self.y = list(array)
        self.length = len(self.y)
        self.lag = lag
        self.threshold = threshold
        self.influence = influence
        self.signals = [0] * len(self.y)
        self.filteredY = np.array(self.y).tolist()
        self.avgFilter = [0] * len(self.y)
        self.stdFilter = [0] * len(self.y)
        self.avgFilter[self.lag - 1] = np.mean(self.y[0:self.lag]).tolist()
        self.stdFilter[self.lag - 1] = np.std(self.y[0:self.lag]).tolist()

    def thresholding_algo(self, new_value):
        self.y.append(new_value)
        i = len(self.y) - 1
        self.length = len(self.y)
        if i < self.lag:
            return 0
        elif i == self.lag:
            self.signals = [0] * len(self.y)
            self.filteredY = np.array(self.y).tolist()
            self.avgFilter = [0] * len(self.y)
            self.stdFilter = [0] * len(self.y)
            self.avgFilter[self.lag] = np.mean(self.y[0:self.lag]).tolist()
            self.stdFilter[self.lag] = np.std(self.y[0:self.lag]).tolist()
            return 0

        self.signals += [0]
        self.filteredY += [0]
        self.avgFilter += [0]
        self.stdFilter += [0]

        if abs(self.y[i] - self.avgFilter[i - 1]) > self.threshold * self.stdFilter[i - 1]:
            if self.y[i] > self.avgFilter[i - 1]:
                self.signals[i] = 1
            else:
                self.signals[i] = -1

            self.filteredY[i] = self.influence * self.y[i] + (1 - self.influence) * self.filteredY[i - 1]
            self.avgFilter[i] = np.mean(self.filteredY[(i - self.lag):i])
            self.stdFilter[i] = np.std(self.filteredY[(i - self.lag):i])
        else:
            self.signals[i] = 0
            self.filteredY[i] = self.y[i]
            self.avgFilter[i] = np.mean(self.filteredY[(i - self.lag):i])
            self.stdFilter[i] = np.std(self.filteredY[(i - self.lag):i])

        return self.signals[i]
Marc
  • 64
  • 4
3

C++ (Qt) Demo Port, Interactive Parameters

I've ported the demo application for this algorithm to C++ (Qt).

The code can be found on GitHub here. A Windows (64-bit) build with an installer is on the releases page. Eventually I will add some documentation and other release builds.

You can't draw points, but you can import them from text files (whitespace separated points -- newlines count as whitespace). You can also adjust the algorithm parameters and see the effects in real time. It's very useful for tuning the algorithm to a particular dataset and for exploring how the parameters affect the results.

enter image description here


The above screenshot is slightly dated; I've since added two experimental options not in the original algorithm:

  • Option to process the data set in reverse (seems to improve results for power spectra at least).
  • Option to set a hard minimum threshold for a peak.

I also added a kludgy zoom / pan bar in the middle of the window, just drag it around with the mouse to zoom and pan.

Vague build instructions:

There's a Windows installer (64-bit) on the releases page but if you want to build it from source, the gist is:

  • Either install Qt's build tools then qmake && make in the same directory as the .pro file, or
  • Install Qt Creator, open the .pro file, choose any default build configuration, and press the build and / or run button (bottom left of Creator).

I've only tested with Qt5. I'm 91% sure the Qt Creator installer will let you install Qt5 if you configure components manually (if you do, you'll also want to confirm that Qt Charts is installed). Qt6 may or may not be a smooth build. Some day, I'll test Qt4 and Qt6 and make these docs better. Maybe.

Jason C
  • 38,729
  • 14
  • 126
  • 182
  • 1
    In any case; if you're on Windows, I just built and uploaded an installer at https://github.com/JC3/ZScorePeakDetection/releases -- lmk for other OSes. No docs yet. – Jason C Jan 04 '23 at 23:20
2

This z-scores method is quite effective at peak detection, which is also helpful for outlier removal. Outlier conversations frequently debate statistical value of each point and ethics of changing data.

But in the case of repeated, erroneous sensor values from error-prone serial communications or an error-prone sensor, there is no statistical value in errors, or spurious readings. They need to be identified and removed.

Visually the errors are obvious. The straight lines across the plot below shows what needs removing. But identifying and removing errors with an algorithm is quite challenging. Z-scores work well.

The figure below has values acquired from a sensor via serial communications. Occasional serial communication errors, sensor error or both lead to repeated, clearly erroneous data points.

The z-score peak detector was able to signal on spurious data points and generated a clean resulting data set while preserving the features of the correct data:

enter image description here

Marc Compere
  • 290
  • 4
  • 17
  • 1
    Very nice application! Thanks for sharing! Did you transform the data before inputting it to the algo? If so, what transformation did you use exactly? Feel free to share a link to your paper or research document if (or when) publicly available; I’ll then add a link to your research to my list of references. Happy coding! :) – Jean-Paul Aug 02 '20 at 16:02
  • 1
    there was no transformation. the top subplot is the original data set from the data acquisition setup. The additional Matlab code was about 2 lines to extract the data set that did not trigger the signal. find indices of untouched data points: `idx_zero=find(signals==0);` then the data is extracted with `y_filtered = y(idx_zero)` – Marc Compere Aug 02 '20 at 23:52
  • 2
    I've spent hours with manually filtering spurious data points from data acquisition systems and have never found a satisfactory general algorithm until discovering this. the separate states to filter new points without changing the average with spurious data points is the key here. Z-scores for sure, but the independent filter state is critical – Marc Compere Aug 02 '20 at 23:57
  • Glad to hear that! Indeed, the separate state for the signaling threshold is they key to making this algo very robust :) Interesting to read that you didn’t even need to transform the data, I expected you would need to apply a first-differencing filter before applying the algo but apparently that is not even needed! Very cool :) – Jean-Paul Aug 03 '20 at 06:48
  • that type of tinkering is what is typical but tedious and custom every time. avoiding that illustrates the value of this algorithm. there is not much discussion in this thread about outlier removal, but this is how I've found it's best utility. – Marc Compere Aug 03 '20 at 12:37
  • the high transients mean the average needs to move quickly with the new good data. that drives `lag` window size of 4 and also `influence=0.1` or 0.2. that allows the moving average following well. I used a modified version of the original matlab script to show the upper and lower bounds (avg+/-N*std) to help tune the parameters and also illustrate which data points caused `signal` and thus removal. final filter values are in the image – Marc Compere Aug 03 '20 at 12:39
2

Dart version of @Jean-Paul Smoothed Z Score algorithm:

class SmoothedZScore {
  int lag = 5;
  num threshold = 10;
  num influence = 0.5;

  num sum(List<num> a) {
    num s = 0;
    for (int i = 0; i < a.length; i++) s += a[i];
    return s;
  }

  num mean(List<num> a) {
    return sum(a) / a.length;
  }

  num stddev(List<num> arr) {
    num arrMean = mean(arr);
    num dev = 0;
    for (int i = 0; i < arr.length; i++) dev += (arr[i] - arrMean) * (arr[i] - arrMean);
    return sqrt(dev / arr.length);
  }

  List<int> smoothedZScore(List<num> y) {
    if (y.length < lag + 2) {
      throw 'y data array too short($y.length) for given lag of $lag';
    }

    // init variables
    List<int> signals = List.filled(y.length, 0);
    List<num> filteredY = List<num>.from(y);
    List<num> leadIn = y.sublist(0, lag);

    var avgFilter = List<num>.filled(y.length, 0);
    var stdFilter = List<num>.filled(y.length, 0);
    avgFilter[lag - 1] = mean(leadIn);
    stdFilter[lag - 1] = stddev(leadIn);

    for (var i = lag; i < y.length; i++) {
      if ((y[i] - avgFilter[i - 1]).abs() > (threshold * stdFilter[i - 1])) {
        signals[i] = y[i] > avgFilter[i - 1] ? 1 : -1;
        // make influence lower
        filteredY[i] = influence * y[i] + (1 - influence) * filteredY[i - 1];
      } else {
        signals[i] = 0; // no signal
        filteredY[i] = y[i];
      }

      // adjust the filters
      List<num> yLag = filteredY.sublist(i - lag, i);
      avgFilter[i] = mean(yLag);
      stdFilter[i] = stddev(yLag);
    }

    return signals;
  }
}
Sga
  • 3,608
  • 2
  • 36
  • 47
2

I've written a Go package for the most popular answer from Jean-Paul. It assumes the y-values are of the type float64.

github.com/MicahParks/peakdetect

The below example uses this package and is based off of the R example from the popular answer mentioned above. It compiles without any dependencies, attempts to keep a low memory footprint, and does not reprocess past points when a new data point comes in. The project has 100% test coverage, mostly from the inputs and outputs of the mentioned R example. But, if anyone spots any errors, please open a GitHub Issue.

Edit: I've made a performance improvement with v0.0.5 that appears to be faster by a factor of 10! It uses Welford's method on initialization and a similar method for computing the mean and population standard deviation over the lag period (sliding window). Special thanks to this answer on another posting: https://stackoverflow.com/a/14638138/14797322

Here is the Golang example based off of the R example:

package main

import (
    "fmt"
    "log"

    "github.com/MicahParks/peakdetect"
)

// This example is the equivalent of the R example from the algorithm's author.
// https://stackoverflow.com/a/54507329/14797322
func main() {
    data := []float64{1, 1, 1.1, 1, 0.9, 1, 1, 1.1, 1, 0.9, 1, 1.1, 1, 1, 0.9, 1, 1, 1.1, 1, 1, 1, 1, 1.1, 0.9, 1, 1.1, 1, 1, 0.9, 1, 1.1, 1, 1, 1.1, 1, 0.8, 0.9, 1, 1.2, 0.9, 1, 1, 1.1, 1.2, 1, 1.5, 1, 3, 2, 5, 3, 2, 1, 1, 1, 0.9, 1, 1, 3, 2.6, 4, 3, 3.2, 2, 1, 1, 0.8, 4, 4, 2, 2.5, 1, 1, 1}

    // Algorithm configuration from example.
    const (
        lag       = 30
        threshold = 5
        influence = 0
    )

    // Create then initialize the peak detector.
    detector := peakdetect.NewPeakDetector()
    err := detector.Initialize(influence, threshold, data[:lag]) // The length of the initial values is the lag.
    if err != nil {
        log.Fatalf("Failed to initialize peak detector.\nError: %s", err)
    }

    // Start processing new data points and determine what signal, if any they produce.
    //
    // This method, .Next(), is best for when data is being processed in a stream, but this simply iterates over a slice.
    nextDataPoints := data[lag:]
    for i, newPoint := range nextDataPoints {
        signal := detector.Next(newPoint)
        var signalType string
        switch signal {
        case peakdetect.SignalNegative:
            signalType = "negative"
        case peakdetect.SignalNeutral:
            signalType = "neutral"
        case peakdetect.SignalPositive:
            signalType = "positive"
        }

        println(fmt.Sprintf("Data point at index %d has the signal: %s", i+lag, signalType))
    }

    // This method, .NextBatch(), is a helper function for processing many data points at once. It's returned slice
    // should produce the same signal outputs as the loop above.
    signals := detector.NextBatch(nextDataPoints)
    println(fmt.Sprintf("1:1 ratio of batch inputs to signal outputs: %t", len(signals) == len(nextDataPoints)))
}
Micah Parks
  • 1,504
  • 1
  • 10
  • 22
1

Instead of comparing a maxima to the mean, one can also compare the maxima to adjacent minima where the minima are only defined above a noise threshold. If the local maximum is > 3 times (or other confidence factor) either adjacent minima, then that maxima is a peak. The peak determination is more accurate with wider moving windows. The above uses a calculation centered on the middle of the window, by the way, rather than a calculation at the end of the window (== lag).

Note that a maxima has to be seen as an increase in signal before and a decrease after.

nichole
  • 121
  • 1
  • 6
1

Object-oriented version of z-score algorithm using mordern C+++

template<typename T>
class FindPeaks{
private:
    std::vector<T> m_input_signal;                      // stores input vector
    std::vector<T> m_array_peak_positive;               
    std::vector<T> m_array_peak_negative;               
    
public:
    FindPeaks(const std::vector<T>& t_input_signal): m_input_signal{t_input_signal}{ }
    
    void estimate(){
        int lag{5};
        T threshold{ 5 };                                                                                       // set a threshold
        T influence{ 0.5 };                                                                                    // value between 0 to 1, 1 is normal influence and 0.5 is half the influence

        std::vector<T> filtered_signal(m_input_signal.size(), 0.0);                                             // placeholdered for smooth signal, initialie with all zeros
        std::vector<int> signal(m_input_signal.size(), 0);                                                          // vector that stores where the negative and positive located
        std::vector<T> avg_filtered(m_input_signal.size(), 0.0);                                                // moving averages
        std::vector<T> std_filtered(m_input_signal.size(), 0.0);                                                // moving standard deviation

        avg_filtered[lag] = findMean(m_input_signal.begin(), m_input_signal.begin() + lag);                         // pass the iteartor to vector
        std_filtered[lag] = findStandardDeviation(m_input_signal.begin(), m_input_signal.begin() + lag);

        for (size_t iLag = lag + 1; iLag < m_input_signal.size(); ++iLag) {                                         // start index frm 
            if (std::abs(m_input_signal[iLag] - avg_filtered[iLag - 1]) > threshold * std_filtered[iLag - 1]) {     // check if value is above threhold             
                if ((m_input_signal[iLag]) > avg_filtered[iLag - 1]) {
                    signal[iLag] = 1;                                                                               // assign positive signal
                }
                else {
                    signal[iLag] = -1;                                                                                  // assign negative signal
                }
                filtered_signal[iLag] = influence * m_input_signal[iLag] + (1 - influence) * filtered_signal[iLag - 1];        // exponential smoothing
            }
            else {
                signal[iLag] = 0;                                                                                         // no signal
                filtered_signal[iLag] = m_input_signal[iLag];
            }

            avg_filtered[iLag] = findMean(filtered_signal.begin() + (iLag - lag), filtered_signal.begin() + iLag);
            std_filtered[iLag] = findStandardDeviation(filtered_signal.begin() + (iLag - lag), filtered_signal.begin() + iLag);

        }

        for (size_t iSignal = 0; iSignal < m_input_signal.size(); ++iSignal) {
            if (signal[iSignal] == 1) {
                m_array_peak_positive.emplace_back(m_input_signal[iSignal]);                                        // store the positive peaks
            }
            else if (signal[iSignal] == -1) {
                m_array_peak_negative.emplace_back(m_input_signal[iSignal]);                                         // store the negative peaks
            }
        }
        printVoltagePeaks(signal, m_input_signal);
        
    }

    std::pair< std::vector<T>, std::vector<T> > get_peaks()
    {
        return std::make_pair(m_array_peak_positive, m_array_peak_negative);
    }

};


template<typename T1, typename T2 >
void printVoltagePeaks(std::vector<T1>& m_signal, std::vector<T2>& m_input_signal) {
    std::ofstream output_file("./voltage_peak.csv");
    std::ostream_iterator<T2> output_iterator_voltage(output_file, ",");
    std::ostream_iterator<T1> output_iterator_signal(output_file, ",");
    std::copy(m_input_signal.begin(), m_input_signal.end(), output_iterator_voltage);
    output_file << "\n";
    std::copy(m_signal.begin(), m_signal.end(), output_iterator_signal);
}

template<typename iterator_type>
typename std::iterator_traits<iterator_type>::value_type findMean(iterator_type it, iterator_type end)
{
    /* function that receives iterator to*/
    typename std::iterator_traits<iterator_type>::value_type sum{ 0.0 };
    int counter = 0;
    while (it != end) {
        sum += *(it++);
        counter++;
    }
    return sum / counter;
}

template<typename iterator_type>
typename std::iterator_traits<iterator_type>::value_type findStandardDeviation(iterator_type it, iterator_type end)
{
    auto mean = findMean(it, end);
    typename std::iterator_traits<iterator_type>::value_type sum_squared_error{ 0.0 };
    int counter{ 0 };
    while (it != end) {
        sum_squared_error += std::pow((*(it++) - mean), 2);
        counter++;
    }
    auto standard_deviation = std::sqrt(sum_squared_error / (counter - 1));
    return standard_deviation;
}
Community
  • 1
  • 1
Spandyie
  • 914
  • 2
  • 11
  • 23
  • 2
    Nice translation. It would be slightly nicer if the object also saves the `filtered_signal`, `signal`, `avg_filtered` and `std_filtered` as private variables and only updates those arrays _once_ when a new datapoint arrives (now the code loops over all datapoints everytime it is called). That would improve the performance of your code and suits the OOP structure even better. – Jean-Paul Jul 12 '19 at 00:05
1

Perl implementation of @Jean-Paul's algorithm.

#!/usr/bin/perl

use strict;
use Data::Dumper;

sub mean {
    my $data = shift;
    my $sum = 0;
    my $mean_val = 0;
    for my $item (@$data) {
        $sum += $item;
    }
    $mean_val = $sum / (scalar @$data) if @$data;
    return $mean_val;
}

sub variance {
    my $data = shift;
    my $variance_val = 0;
    my $mean_val = mean($data);
    my $sum = 0;
    for my $item (@$data) {
        $sum += ($item - $mean_val)**2;
    }
    $variance_val = $sum / (scalar @$data) if @$data;
    return $variance_val;
}

sub std {
    my $data = shift;
    my $variance_val = variance($data);
    return sqrt($variance_val);
}

# @param y - The input vector to analyze
# @parameter lag - The lag of the moving window
# @parameter threshold - The z-score at which the algorithm signals
# @parameter influence - The influence (between 0 and 1) of new signals on the mean and standard deviation
sub thresholding_algo {
    my ($y, $lag, $threshold, $influence) = @_;

    my @signals = (0) x @$y;
    my @filteredY = @$y;
    my @avgFilter = (0) x @$y;
    my @stdFilter = (0) x @$y;

    $avgFilter[$lag - 1] = mean([@$y[0..$lag-1]]);
    $stdFilter[$lag - 1] = std([@$y[0..$lag-1]]);

    for (my $i=$lag; $i <= @$y - 1; $i++) {
        if (abs($y->[$i] - $avgFilter[$i-1]) > $threshold * $stdFilter[$i-1]) {
            if ($y->[$i] > $avgFilter[$i-1]) {
                $signals[$i] = 1;
            } else {
                $signals[$i] = -1;
            }

            $filteredY[$i] = $influence * $y->[$i] + (1 - $influence) * $filteredY[$i-1];
            $avgFilter[$i] = mean([@filteredY[($i-$lag)..($i-1)]]);
            $stdFilter[$i] = std([@filteredY[($i-$lag)..($i-1)]]);
        }
        else {
            $signals[$i] = 0;
            $filteredY[$i] = $y->[$i];
            $avgFilter[$i] = mean([@filteredY[($i-$lag)..($i-1)]]);
            $stdFilter[$i] = std([@filteredY[($i-$lag)..($i-1)]]);
        }
    }

    return {
        signals => \@signals,
        avgFilter => \@avgFilter,
        stdFilter => \@stdFilter
    };
}

my $y = [1,1,1.1,1,0.9,1,1,1.1,1,0.9,1,1.1,1,1,0.9,1,1,1.1,1,1,1,1,1.1,0.9,1,1.1,1,1,0.9,
       1,1.1,1,1,1.1,1,0.8,0.9,1,1.2,0.9,1,1,1.1,1.2,1,1.5,1,3,2,5,3,2,1,1,1,0.9,1,1,3,
       2.6,4,3,3.2,2,1,1,0.8,4,4,2,2.5,1,1,1];

my $lag = 30;
my $threshold = 5;
my $influence = 0;

my $result = thresholding_algo($y, $lag, $threshold, $influence);

print Dumper $result;
Alen
  • 49
  • 5
1

This is a Python implementation of the Robust peak detection algorithm algorithm.

The initialization and the computational part are split up, only the filtered_y array has been kept, that has a maximum size equals lag, so there are no memory increases. (Result are the same of above answers). In order to plot the graph, also the labels array is kept.

I made a github gist.

import numpy as np
import pylab

def init(x, lag, threshold, influence):
    '''
    Smoothed z-score algorithm
    Implementation of algorithm from https://stackoverflow.com/a/22640362/6029703
    '''

    labels = np.zeros(lag)
    filtered_y = np.array(x[0:lag])
    avg_filter = np.zeros(lag)
    std_filter = np.zeros(lag)
    var_filter = np.zeros(lag)

    avg_filter[lag - 1] = np.mean(x[0:lag])
    std_filter[lag - 1] = np.std(x[0:lag])
    var_filter[lag - 1] = np.var(x[0:lag])

    return dict(avg=avg_filter[lag - 1], var=var_filter[lag - 1],
                std=std_filter[lag - 1], filtered_y=filtered_y,
                labels=labels)


def add(result, single_value, lag, threshold, influence):
    previous_avg = result['avg']
    previous_var = result['var']
    previous_std = result['std']
    filtered_y = result['filtered_y']
    labels = result['labels']

    if abs(single_value - previous_avg) > threshold * previous_std:
        if single_value > previous_avg:
            labels = np.append(labels, 1)
        else:
            labels = np.append(labels, -1)

        # calculate the new filtered element using the influence factor
        filtered_y = np.append(filtered_y, influence * single_value
                               + (1 - influence) * filtered_y[-1])
    else:
        labels = np.append(labels, 0)
        filtered_y = np.append(filtered_y, single_value)

    # update avg as sum of the previuos avg + the lag * (the new calculated item - calculated item at position (i - lag))
    current_avg_filter = previous_avg + 1. / lag * (filtered_y[-1]
            - filtered_y[len(filtered_y) - lag - 1])

    # update variance as the previuos element variance + 1 / lag * new recalculated item - the previous avg -
    current_var_filter = previous_var + 1. / lag * ((filtered_y[-1]
            - previous_avg) ** 2 - (filtered_y[len(filtered_y) - 1
            - lag] - previous_avg) ** 2 - (filtered_y[-1]
            - filtered_y[len(filtered_y) - 1 - lag]) ** 2 / lag)  # the recalculated element at pos (lag) - avg of the previuos - new recalculated element - recalculated element at lag pos ....

    # calculate standard deviation for current element as sqrt (current variance)
    current_std_filter = np.sqrt(current_var_filter)

    return dict(avg=current_avg_filter, var=current_var_filter,
                std=current_std_filter, filtered_y=filtered_y[1:],
                labels=labels)

lag = 30
threshold = 5
influence = 0

y = np.array([1,1,1.1,1,0.9,1,1,1.1,1,0.9,1,1.1,1,1,0.9,1,1,1.1,1,1,1,1,1.1,0.9,1,1.1,1,1,0.9,
       1,1.1,1,1,1.1,1,0.8,0.9,1,1.2,0.9,1,1,1.1,1.2,1,1.5,1,3,2,5,3,2,1,1,1,0.9,1,1,3,
       2.6,4,3,3.2,2,1,1,0.8,4,4,2,2.5,1,1,1])

# Run algo with settings from above
result = init(y[:lag], lag=lag, threshold=threshold, influence=influence)

i = open('quartz2', 'r')
for i in y[lag:]:
    result = add(result, i, lag, threshold, influence)

# Plot result
pylab.subplot(211)
pylab.plot(np.arange(1, len(y) + 1), y)
pylab.subplot(212)
pylab.step(np.arange(1, len(y) + 1), result['labels'], color='red',
           lw=2)
pylab.ylim(-1.5, 1.5)
pylab.show()
MatteoB
  • 53
  • 7
1

Fsharp implementation of Brakel, J.P.G. van solution

#r "nuget: FSharp.Stats"

open FSharp.Stats

/// <summary>
/// <para>
/// Brakel, J.P.G. van (2014). "Robust peak detection algorithm using z-scores".
/// Stack Overflow. Available at:
/// <see cref="https://stackoverflow.com/questions/22583391/peak-signal-detection-in-realtime-timeseries-data/22640362#22640362">this link</see>
/// (version: 2020-11-08).
/// </para>
/// </summary>
module zScorePeakDetection =
    /// <summary>
    /// Calculate the average of a list
    /// </summary>
    let Mean (values: list<double>) = values |> List.average

    /// <summary>
    /// Standard deviation using FSharp.Stats NuGet package library.
    /// </summary>
    let StdDev (values: list<double>) = Seq.stDevPopulation values

    /// <summary>
    /// Peak detection algorithm
    /// </summary>
    /// <param name="input">This is the raw input from which peak detection is desired</param>
    /// <param name="lag">The sample lag to base observations on. A value of 5 will mean a moving window of 5 samples to base the average and standard deviation on</param>
    /// <param name="threshold">Signal to go high or low when this many standard deviations away from average</param>
    /// <param name="influence">Between 0 (no influence) and 1 (full influence)</param>
    /// <returns>Tuple of the averageFilter, signals, and stdFilter</returns>
    let StartAlgo (input: list<double>) (lag: int) (threshold: double) (influence: double) =
        let signals = Array.init input.Length (fun _ -> 0)
        let filteredY = input |> List.toArray
        let (averageFilter: array<double>) = Array.init input.Length (fun _ -> 0.0)
        let (stdFilter: array<double>) = Array.init input.Length (fun _ -> 0.0)

        let initialWindow = filteredY |> Array.take lag |> Array.toList

        averageFilter[lag - 1] <- Mean initialWindow
        stdFilter[lag - 1] <- StdDev initialWindow

        for i = lag to (input.Length - 1) do
            if (abs (input[i] - averageFilter[i - 1]) > threshold * stdFilter[i - 1]) then
                signals[i] <- if (input[i] > averageFilter[i - 1]) then 1 else -1
                filteredY[i] <- influence * input[i] + (1.0 - influence) * filteredY[i - 1]
            else
                signals[i] <- 0
                filteredY[i] <- input[i]

            let slidingWindow =
                filteredY |> Array.skip (i - lag) |> Array.take (lag + 1) |> Array.toList

            averageFilter[i] <- Mean slidingWindow
            stdFilter[i] <- StdDev slidingWindow

        // Return final output
        averageFilter, signals, stdFilter
DieOde
  • 61
  • 4
0

Assuming your data is streaming from a sensor (and so the algorithm can't know anything about the future),

I made this algo which works pretty weel with the data I get in my own project.

The algo has 2 parameters: sensitivity and window.

In the end a single line of code can get your result:

detected=data.map((a, b, c) => (a > 0) ? c[b] ** 4 * c[b - 1] ** 3 : -0).map((a, b, c) => a > Math.max(...c.slice(2)) / sensitivity).map((a, b, c) => (b > dwindow) && c.slice(b - dwindow, b).indexOf(a) == -1);

Since I am programmer and not a mathematician, I can't explaint it better than this. But I am sure someone can.

sensitivity = 20;
dwindow = 4;

data = [1., 1., 1., 1., 1., 1., 1., 1.1, 1., 0.8, 0.9,
  1., 1.2, 0.9, 1., 1., 1.1, 1.2, 1., 1.5, 1., 3.,
  2., 5., 3., 2., 1., 1., 1., 0.9, 1., 1., 3.,
  2.6, 4., 3., 3.2, 2., 1., 1., 1., 1., 1.
];
//data = data.concat(data);
//data = data.concat(data);
var data1 = [{
  name: 'original source',
  y: data
}];
Plotly.newPlot('stage1', data1, {
  title: 'Sensor data',
  yaxis: {
    title: 'signal'
  }
});

filtered = data.map((a, b, c) => (a > 0) ? c[b] ** 4 * c[b - 1] ** 3 : -0);

var data2 = [{
  name: 'filtered source',
  y: filtered
}];
Plotly.newPlot('stage2', data2, {
  title: 'Filtered  data<br>aₙ = aₙ⁴ * aₙ₋₁³',
  yaxis: {
    title: 'signal'
  }
});

dwindow = 6;
k = dwindow;
detected = filtered.map((a, b, c) => a > Math.max(...c.slice(2)) / sensitivity).map((a, b, c) => (b > k) && c.slice(b - k, b).indexOf(a) == -1)

var data3 = [{
  name: 'detected peaks',
  y: detected
}];
Plotly.newPlot('stage3', data3, {
  title: 'Window 6',
  yaxis: {
    title: 'signal'
  }
});

dwindow = 10;
k = dwindow;
detected = filtered.map((a, b, c) => a > Math.max(...c.slice(2)) / 20).map((a, b, c) => (b > k) && c.slice(b - k, b).indexOf(a) == -1)
var data4 = [{
  name: 'detected peaks',
  y: detected
}];
Plotly.newPlot('stage4', data4, {
  title: 'Window 10',
  yaxis: {
    title: 'signal'
  }
});
<script src="https://cdn.jsdelivr.net/npm/plotly.js@2.16.5/dist/plotly.min.js"></script>
<div id="stage1"></div>
<div id="stage2"></div>
<div id="stage3"></div>
<div id="stage4"></div>
Zibri
  • 9,096
  • 3
  • 52
  • 44
0

Alternatively also this algorithm works very well for me...

sensitivity = 4;
dwindow = 4;
k = dwindow;

data = [1., 1., 1., 1., 1., 1., 1., 1.1, 1., 0.8, 0.9,
  1., 1.2, 0.9, 1., 1., 1.1, 1.2, 1., 1.5, 1., 3.,
  2., 5., 3., 2., 1., 1., 1., 0.9, 1., 1., 3.,
  2.6, 4., 3., 3.2, 2., 1., 1., 1., 1., 1.
];
//data = data.concat(data);
//data = data.concat(data);
var data1 = [{
  name: 'original source',
  y: data
}];
Plotly.newPlot('stage1', data1, {
  title: 'Sensor data',
  yaxis: {
    title: 'signal'
  }
});

filtered = data.map((a,b,c)=>a>=Math.max(...c.slice(b-k,b))?a**3:0);

var data2 = [{
  name: 'filtered source',
  y: filtered
}];
Plotly.newPlot('stage2', data2, {
  title: 'Filtered  data<br>aₙ = aₙ³',
  yaxis: {
    title: 'signal'
  }
});

dwindow = 6;
k = dwindow;
detected = filtered.map((a,b,c)=>a>Math.max(...c.slice(2))/sensitivity).map((a,b,c)=>(b>k) && c.slice(b-k,b).indexOf(a)==-1 );

var data3 = [{
  name: 'detected peaks',
  y: detected
}];
Plotly.newPlot('stage3', data3, {
  title: 'Maximum in a window of 6',
  yaxis: {
    title: 'signal'
  }
});

dwindow = 10;
k = dwindow;
detected = filtered.map((a, b, c) => a > Math.max(...c.slice(2)) / 20).map((a, b, c) => (b > k) && c.slice(b - k, b).indexOf(a) == -1)
var data4 = [{
  name: 'detected peaks',
  y: detected
}];
Plotly.newPlot('stage4', data4, {
  title: 'Maximum in a window of 10',
  yaxis: {
    title: 'signal'
  }
});
<script src="https://cdn.jsdelivr.net/npm/plotly.js@2.16.5/dist/plotly.min.js"></script>
<div id="stage1"></div>
<div id="stage2"></div>
<div id="stage3"></div>
<div id="stage4"></div>
Zibri
  • 9,096
  • 3
  • 52
  • 44