How to find changing points in a dataset

Question

I need to find the points at which an increasing or decreasing trend starts and ends. In this data, a difference of ~10 between consecutive values is considered noise (i.e. not an increase or decrease). From the sample data given below, the first increasing trend would start at 317 and end at 432, and another would start at 441 and end at 983. Each of these points are to be recorded in a separate vector.

sample<- c(312,317,380,432,438,441,509,641,779,919,
           983,980,978,983,986,885,767,758,755)

Below is an image of the main change points. Can anyone suggest an R method for this?

enter image description here

You do not give enough information for anyone to be able to answer. Please read [How do I ask a good question?](https://stackoverflow.com/help/how-to-ask) and [How to make a great R reproducible example?](https://stackoverflow.com/q/5963269/4752675). They will help you formulate a question that someone can answer. — G5W, Sep 04 '17 at 01:09
Your question is too broad. There are many possible types of pattern in a dataset, & it would be a stretch to say there's one true way to deal with all of them. Is increasing numerical trend your *actual* use case? Or perhaps matching part of some text string in a character variable? Maybe fuzzy instead of exact match? Describe your actual problem, & you are more likely to get solutions that fit your particular scenario. — Z.Lin, Sep 04 '17 at 01:31
`x <- c(10, 11, 11, 12, 15, 18, 25, 30, 31, 31, 31) runs <- rle(diff(x) > 3); c(x[cumsum(runs$lengths)[which(runs$values) - 1]], x[cumsum(runs$lengths)[runs$values] + 1])` — alistaire, Sep 04 '17 at 01:31
Hi, could you explain the logic between your input `x` and expected outcome please, as `diff(x)` indicates several other changes in trend. (ps i wonder more generally if you ar elooking for change-point analysis?) — user20650, Sep 04 '17 at 01:32
I'd search here for `[r] peaks` - this has been discussed in depth a number of times on stackexchange. https://stats.stackexchange.com/questions/22974/how-to-find-local-peaks-valleys-in-a-series-of-data https://stackoverflow.com/questions/16341717/detecting-cycle-maxima-peaks-in-noisy-time-series-in-r https://stackoverflow.com/questions/6836409/finding-local-maxima-and-minima https://stackoverflow.com/questions/14319826/finding-local-maxima-minima-in-r https://stats.stackexchange.com/questions/30750/finding-local-extrema-of-a-density-function-using-splines etc etc — thelatemail, Sep 04 '17 at 01:36
@user20650: Please refer to the image I have attached with my question. — veggie crunch burger, Sep 04 '17 at 01:55
@Z.Lin: Thanks. I have attached an image of the exact points I am looking for. Please refer to it. — veggie crunch burger, Sep 04 '17 at 01:55
@G5W: Hi-I am sorry. My first question on Stackoverflow. :). I have attached an image that shows the points I am looking for in a dataset. Please refer to it. — veggie crunch burger, Sep 04 '17 at 02:22
@raghavkalyan The definition of change point keeps changing in this question. Would you like to find the statistically significant change points? If so, that's easy to answer. If not, how are you personally defining a change point? — www, Sep 04 '17 at 05:05
@RyanRunge: Thanks for your help! The change point here would mean as follows E.g: From the sample data given: The increasing trend starts at 317 and ends at 432 and again starts at 441 and ends at 919. This goes on until the end of the dataset. Each of these points are to be recorded in a separate vector. In terms of value, a difference of ~10 is considered noise and not an increase or decrease in trend. The attached image is the final one and the points that are given are the ones I need in an array. — veggie crunch burger, Sep 04 '17 at 05:41
@ycw and others trying to help me: Apologies as I am new to this forum as well as R. I do not have a reproducible example for this apart from the sample data. I have tried my logic using multiple while loops which is not very successful. The attached image is the final one and the points that are given are the ones I need in a separate array. Thanks for your patience! — veggie crunch burger, Sep 04 '17 at 05:44

www · Accepted Answer · 2017-09-05T05:01:36.977

Here's how to make the change point vector:

vec <- c(100312,100317,100380,100432,100438,100441,100509,100641,100779,100919,
         100983,100980,100978,100983,100986,100885,100767,100758,100755,100755)

#this finds your trend start/stops
idx <- c(cumsum(rle(abs(diff(vec))>10)$lengths)+1)

#create new vector of change points:
newVec <- vec[idx]
print(newVec)
[1] 100317 100432 100441 100983 100986 100767 100755

#(opt.) to ignore the first and last observation as a change point:
idx <- idx[which(idx!=1 & idx!=length(vec))]

#update new vector if you want the "opt." restrictions applied:
newVec <- vec[idx]
print(newVec)
[1] 100317 100432 100441 100983 100986 100767

#you can split newVec by start/stop change points like this:
start_changepoints <- newVec[c(TRUE,FALSE)]
print(start_changepoints)
[1] 100317 100441 100986

end_changepoints <- newVec[c(FALSE,TRUE)]
print(end_changepoints)
[1] 100432 100983 100767

#to count the number of events, just measure the length of start_changepoints:
length(start_changepoints)
[1] 3

If you then want to plot that, you can use this:

require(ggplot2)

#preps data for plot
df <- data.frame(vec,trends=NA,cols=NA)
df$trends[idx] <- idx
df$cols[idx] <- c("green","red")

#plot
ggplot(df, aes(x=1:NROW(df),y=vec)) +
  geom_line() +
  geom_point() +
  geom_vline(aes(xintercept=trends, col=cols), 
             lty=2, lwd=1) +
  scale_color_manual(values=na.omit(df$cols),
                     breaks=na.omit(unique(df$cols)),
                     labels=c("Start","End")) +
  xlab("Index") +
  ylab("Value") +
  guides(col=guide_legend("Trend State"))

Output:

This is absolutely fantastic! Right exactly what I wanted. Thanks for understanding this though I wasnt able to put it in the right way. I have another query please. I want to count the number of increasing or decreasing trends. E.g.: In the above data count, it will be 3 (No. of starts and completed ends) within a given dataset. — veggie crunch burger, Sep 05 '17 at 01:37
Glad it was helpful. See the edit above for an answer to your additional question. — www, Sep 05 '17 at 05:02
What if the trend starts on the first element of the vector? Like `vec <- c(1, 100, 200, 100312,100317,100380,100432,100438,100441,100509,100641,100779,100919, 100983,100980,100978,100983,100986,100885,100767,100758,100755,100755)` Then `c(cumsum(rle(abs(diff(vec))>10)$lengths)+1)` doesn't return the index of the first very first element — Pablo Rod, Jul 31 '19 at 14:25

How to find changing points in a dataset

1 Answers1

Linked

Related