Conditional peak & valley signal detection in realtime timeseries data [R]

Question

I have a timeseries data which contain some peaks and valleys which significantly differ from the threshold level (an example vector is provided below).

The peak/valley height/width may vary as well as the noise level.

I am interested in finding & reporting both peaks and valleys.

Currently I am using a function based on this thread: Peak signal detection in realtime timeseries data

However, I would like to improve it according to my needs: I would like to introduce an additional parameter: I would like to ignore the peaks/valleys if they are not significant enough (below certain height - peaks, not enough downward). On the attached picture, I put red circles on peaks I would like to be reported; unmarked ones should be ignored).

Here is an example vector:

signal <- c(659, 613, 586, 642, 685, 695, 691, 733, 638, 708, 706, 691, 703, 712, 715, 700, 693, 682, 717, 711, 722, 700, 704, 704, 715, 691, 670, 684, 689, 711, 680, 692, 686, 710, 702, 699, 702, 715, 691, 670, 684, 689, 711, 673, 688, 699, 677, 701, 680, 692, 686, 710, 702, 699, 717, 691, 703, 712, 715, 700, 693, 682, 717, 711, 722, 700, 704, 704, 715, 691, 670, 684, 689, 711, 680, 692, 686, 710, 702, 699, 702, 706, 651, 712, 722, 734, 705, 714, 691, 704, 704, 669, 712, 715, 689, 715, 691, 670, 684, 689, 711, 651, 712, 722, 734, 705, 714, 691, 686, 676, 690, 693, 702, 694, 682, 693, 724, 693, 707, 684, 687, 705, 680, 680, 705, 680, 693, 700, 704, 704, 669, 712, 715, 689, 715, 691, 670, 684, 689, 711, 673, 678, 699, 677, 680, 682, 676, 690, 658, 675, 663, 667, 682, 673, 675, 656, 652, 563, 544, 542, 540, 532, 538, 505, 526, 565, 629, 720, 713, 720, 720, 773, 732, 740, 695, 689, 723, 685, 726, 710, 684, 693, 715, 692, 683, 712, 707, 693, 699, 717, 703, 687, 682, 690, 716, 708, 713, 700, 676, 708, 691, 717, 711, 722, 688, 695, 641, 666, 638, 639, 600, 635, 609, 653, 671, 649, 716, 708, 713, 700, 676, 708, 691, 717, 711, 722, 700, 704, 704, 669, 712, 715, 689, 715, 691, 670, 684, 689, 711, 673, 688, 699, 677, 701, 680, 692, 686, 710, 702, 699, 717, 691, 703, 712, 715, 700, 693, 682, 717, 711, 722, 700, 704, 704, 715, 691, 670, 684, 689, 711, 680, 692, 686, 710, 702, 699, 702, 715, 691, 670, 684, 689, 711, 673, 688, 699, 677, 701, 680, 692, 686, 710, 702, 699, 717, 691, 703, 712, 715, 700, 693, 682, 717, 711, 722, 700, 704, 704, 715, 691, 670, 684, 769, 767, 740, 752, 686, 710, 702, 699, 702, 706, 651, 712, 722, 734, 705, 714, 691, 704, 704, 669, 712, 715, 689, 715, 691, 670, 684, 689, 711, 704, 669, 712, 715, 689, 715, 691, 670, 684, 689, 711, 673, 688, 699, 677, 701, 680, 692, 686, 710, 702, 699, 717, 691, 703, 712, 715, 700, 693, 682, 717, 711, 722, 700, 704, 704, 715, 691, 670, 684, 689, 711, 680, 692, 686, 710, 702, 699, 702, 715, 691, 670, 684, 689, 711, 673, 688, 699, 677, 701, 680, 692, 686, 710, 702, 699, 665, 630, 808, 686, 787, 781, 796, 815, 786, 793, 664, 717, 691, 703, 712, 715, 700, 693, 682, 717, 711, 722, 700, 704, 704, 715, 691, 670, 684, 689, 711, 680, 692, 686, 710, 702, 699, 702, 706, 651, 712, 722, 734, 705, 714, 691, 704, 704, 669, 712, 715, 689, 715, 691, 670, 684, 689, 711)

And here is a depiction of what I want (red circles - peaks to be reported):

Allan Cameron · Answer 1 · 2022-06-29T11:12:16.500

Let's start by plotting your data, along with its mean:

plot(signal, type = 'l')
abline(h = mean(signal), col = 'blue3', lty = 2)

Since you want the upwards peak near the end of the series to be included but not the downwards spike at the very start, we need to find the number of standard deviations that would identify this peak but not the one at the start. This turns out to be about 3.5 standard deviations, as we can see if we plot the 3.5 standard deviation lines:

abline(h = mean(signal) + c(3.5, -3.5) * sd(signal), lty = 2, col = 'red')

Now comes the tricky part. We use run length encoding to identify which contiguous parts of the sequence are outside of the 3.5 standard deviations, and find the point of largest absolute deviation for each:

big <- abs(signal - mean(signal)) > 3.5 * sd(signal)
exceed <- split(seq_along(big), data.table::rleid(big))[rle(big)$value]
peaks <- sapply(exceed, function(x) x[which.max(abs(signal[x] - mean(signal)))])

Now the vector peaks will contain only the maximum or minimum point within each spike. Note although you only have two spikes in your sample data, this will work for however many spikes you have.

To demonstrate, let us plot the points:

points(peaks, signal[peaks], col = 'red', pch = 16)

Could you define the "big" variable? Would be grateful. – ramen Jun 29 '22 at 11:08 — ramen, Jun 29 '22 at 11:08
@ramen sorry - missed that. Edited to include `big` – Allan Cameron Jun 29 '22 at 11:12 — Allan Cameron, Jun 29 '22 at 11:12

score 2 · Accepted Answer · answered Jul 26 '22 at 10:04

I am the author of the original algorithm you were referring to.

To answer your question, let's first discuss the characteristics of your data:

The timeseries is stationary: the average value seems to be constant around 700
There are infrequent peaks, both up and down
The peaks do not change the distribution or the trend of the data
The peaks are clustered: if there is a peak (signal), the peak persists for approximately 20-30 data points

Because your timeseries is stationary, you should put the influence parameter close to zero. Because there are infrequent peaks, you should put the threshold high (above 3 std. devs). Because the peaks are clustered, you should put the lag parameter as high as you can.

For example, using:

lag <- 130
threshold <- 4
influence <- 0

Yields the following:

Here is the accompanying code:

ThresholdingAlgo <- function(y,lag,threshold,influence) {
  signals <- rep(0,length(y))
  filteredY <- y[0:lag]
  avgFilter <- NULL
  stdFilter <- NULL
  avgFilter[lag] <- mean(y[0:lag], na.rm=TRUE)
  stdFilter[lag] <- sd(y[0:lag], na.rm=TRUE)
  for (i in (lag+1):length(y)){
    if (abs(y[i]-avgFilter[i-1]) > threshold*stdFilter[i-1]) {
      if (y[i] > avgFilter[i-1]) {
        signals[i] <- 1;
      } else {
        signals[i] <- -1;
      }
      filteredY[i] <- influence*y[i]+(1-influence)*filteredY[i-1]
    } else {
      signals[i] <- 0
      filteredY[i] <- y[i]
    }
    avgFilter[i] <- mean(filteredY[(i-lag):i], na.rm=TRUE)
    stdFilter[i] <- sd(filteredY[(i-lag):i], na.rm=TRUE)
  }
  return(list("signals"=signals,"avgFilter"=avgFilter,"stdFilter"=stdFilter))
}


y <- c(659, 613, 586, 642, 685, 695, 691, 733, 638, 708, 706, 691, 703, 712, 715, 700, 693, 682, 717, 711, 722, 700, 704, 704, 715, 691, 670, 684, 689, 711, 680, 692, 686, 710, 702, 699, 702, 715, 691, 670, 684, 689, 711, 673, 688, 699, 677, 701, 680, 692, 686, 710, 702, 699, 717, 691, 703, 712, 715, 700, 693, 682, 717, 711, 722, 700, 704, 704, 715, 691, 670, 684, 689, 711, 680, 692, 686, 710, 702, 699, 702, 706, 651, 712, 722, 734, 705, 714, 691, 704, 704, 669, 712, 715, 689, 715, 691, 670, 684, 689, 711, 651, 712, 722, 734, 705, 714, 691, 686, 676, 690, 693, 702, 694, 682, 693, 724, 693, 707, 684, 687, 705, 680, 680, 705, 680, 693, 700, 704, 704, 669, 712, 715, 689, 715, 691, 670, 684, 689, 711, 673, 678, 699, 677, 680, 682, 676, 690, 658, 675, 663, 667, 682, 673, 675, 656, 652, 563, 544, 542, 540, 532, 538, 505, 526, 565, 629, 720, 713, 720, 720, 773, 732, 740, 695, 689, 723, 685, 726, 710, 684, 693, 715, 692, 683, 712, 707, 693, 699, 717, 703, 687, 682, 690, 716, 708, 713, 700, 676, 708, 691, 717, 711, 722, 688, 695, 641, 666, 638, 639, 600, 635, 609, 653, 671, 649, 716, 708, 713, 700, 676, 708, 691, 717, 711, 722, 700, 704, 704, 669, 712, 715, 689, 715, 691, 670, 684, 689, 711, 673, 688, 699, 677, 701, 680, 692, 686, 710, 702, 699, 717, 691, 703, 712, 715, 700, 693, 682, 717, 711, 722, 700, 704, 704, 715, 691, 670, 684, 689, 711, 680, 692, 686, 710, 702, 699, 702, 715, 691, 670, 684, 689, 711, 673, 688, 699, 677, 701, 680, 692, 686, 710, 702, 699, 717, 691, 703, 712, 715, 700, 693, 682, 717, 711, 722, 700, 704, 704, 715, 691, 670, 684, 769, 767, 740, 752, 686, 710, 702, 699, 702, 706, 651, 712, 722, 734, 705, 714, 691, 704, 704, 669, 712, 715, 689, 715, 691, 670, 684, 689, 711, 704, 669, 712, 715, 689, 715, 691, 670, 684, 689, 711, 673, 688, 699, 677, 701, 680, 692, 686, 710, 702, 699, 717, 691, 703, 712, 715, 700, 693, 682, 717, 711, 722, 700, 704, 704, 715, 691, 670, 684, 689, 711, 680, 692, 686, 710, 702, 699, 702, 715, 691, 670, 684, 689, 711, 673, 688, 699, 677, 701, 680, 692, 686, 710, 702, 699, 665, 630, 808, 686, 787, 781, 796, 815, 786, 793, 664, 717, 691, 703, 712, 715, 700, 693, 682, 717, 711, 722, 700, 704, 704, 715, 691, 670, 684, 689, 711, 680, 692, 686, 710, 702, 699, 702, 706, 651, 712, 722, 734, 705, 714, 691, 704, 704, 669, 712, 715, 689, 715, 691, 670, 684, 689, 711)

lag       <- 130
threshold <- 4
influence <- 0

# Run algo with lag = 30, threshold = 5, influence = 0
result <- ThresholdingAlgo(y,lag,threshold,influence)

# Plot result
par(mfrow = c(2,1),oma = c(2,2,0,0) + 0.1,mar = c(0,0,2,1) + 0.2)
plot(1:length(y),y,type="l",ylab="",xlab="") 
lines(1:length(y),result$avgFilter,type="l",col="cyan",lwd=2)
lines(1:length(y),result$avgFilter+threshold*result$stdFilter,type="l",col="green",lwd=2)
lines(1:length(y),result$avgFilter-threshold*result$stdFilter,type="l",col="green",lwd=2)
plot(result$signals,type="S",col="red",ylab="",xlab="",ylim=c(-1.5,1.5),lwd=2)
}

Now if you also wish to find the minimum or maximum value within each signal, you can simply overlay the result array on your y (data) array and identify where it achieves the highest or lowest value.

For example (using explicit loops instead of a quick one-liner):

cluster$flag <- FALSE
cluster$sign <- FALSE
cluster$data <- NULL
cluster$index <- NULL
for(i in 1:length(result$signals)) {
  if (abs(result$signals[i]) == 1.0) {
    cluster$flag <- TRUE
    cluster$sign <- result$signals[i]
    cluster$data <- append(cluster$data, y[i])
    cluster$index <- append(cluster$index, i)
  } else {
    if (cluster$flag) {
      if (cluster$sign == 1) {
        print(cat("The positive cluster starting at", cluster$index[1], "has maximum", max(cluster$data), "\n"), sep='')
      } else {
        print(cat("The negative cluster starting at", cluster$index[1], "has minimum", min(cluster$data), "\n"), sep='')
      }
    }
    cluster$flag <- FALSE
    cluster$sign <- NULL
    cluster$data <- NULL
    cluster$index <- NULL
  }
}

This gives the following output:

The negative cluster starting at 158 has minimum 505 
The positive cluster starting at 410 has maximum 808 
The positive cluster starting at 412 has maximum 815

Hope that helps!

Hi, pleasure to meet. I modified your code so at the beginning of the original vector I impute the data sampled from the original signal, so the algo uses them instead of the first n datapoints to adjust the filters. According to my empyrical observations, it works best when lag is set to 100 and threshold to 3.5. At the end I trim the output accordingly. Also I noticed it works slightly better when mean is replaced by mad. So I hope you do not mind those changes. Also I cited you accordingly (as suggested in the original thread). — ramen, Jul 26 '22 at 12:45
@ramen Of course, I'm glad you found the algorithm helpful and it is definitely advisable to tweak the algorithm to your needs! — Jean-Paul, Aug 16 '22 at 20:57

score 1 · Answer 3 · answered Jun 28 '22 at 11:19

1

Maybe you want something like this:

max.index <- which.max(signal)
max <- max(signal)
min.index <- which.min(signal)
min <- min(signal)
plot(signal, type = "l")
points(max.index, max, col = "red", pch = 16)
points(min.index, min, col = "red", pch = 16)

Output:

answered Jun 28 '22 at 11:19

Quinten

35,235
5
20
53

1

I think this would only work if there was exactly one upwards and one downwards significant spike Quinten - is that right? – Allan Cameron Jun 28 '22 at 11:26
@AllanCameron, yes you are right, your answer is better. I like the way you fix the tricky part! +1 – Quinten Jun 28 '22 at 11:35

Conditional peak & valley signal detection in realtime timeseries data [R]

3 Answers3