2

I have a vector in R:

data <- c(1,4,6,7,8,9,20,30,31,32,33,34,35,60)

What I want is to find the start and end of a successive stretch longer than 3 successive values. i.e.:

start end
3  6  (stretch 6-9)
8 13 (stretch 30-35

I have no clue how to get there.

Fenrir
  • 85
  • 1
  • 5
  • maybe if you look at `rle()` and the lagged difference. If they are sequential values, the lagged difference will be 1. Look for sequences of 1's in this using `rle()` – cory Mar 16 '16 at 17:07

3 Answers3

5

From @eddi's answer to my similar question...

runs = split(seq_along(data), cumsum(c(0, diff(data) > 1)))
lapply(runs[lengths(runs) > 1], range)

# $`2`
# [1] 3 6
# 
# $`4`
# [1]  8 13

How it works:

  • seq_along(data) are the indices of data, from 1..length(data)
  • c(0, diff(data) > 1) is has a 1 at each index where data "jumps"
  • cumsum(c(0, diff(data) > 1)) is an identifier for consecutive runs between jumps

So runs is a division of data's indices into runs where data's values are consecutive.

Community
  • 1
  • 1
Frank
  • 66,179
  • 8
  • 96
  • 180
  • this is a great answer, if a little subtle without an accompanying explanation – C8H10N4O2 Mar 16 '16 at 18:21
  • 1
    To make it look like OP desired output, you could do something like: `df <- as.data.frame(do.call(rbind, lapply(runs[lengths(runs) > 1], range))); names(df) <- c("start","end")`, although the desired output is not clearly specified – C8H10N4O2 Mar 16 '16 at 18:50
  • This is exactly what I was looking for. A matlab colleague came up with something similar in one of the matlab fora. http://nl.mathworks.com/matlabcentral/answers/86420-find-a-series-of-consecutive-numbers-in-a-vector?requestedDomain=www.mathworks.com – Fenrir Mar 16 '16 at 19:22
  • @user1712989 Cool. I learned matlab a long time before r, so I guess it might still influence my approach :) – Frank Mar 16 '16 at 19:28
0

So, first take the diff of a and do a run length sequence on it. Then, the starting points are the index before the 2s and the ending points are the negative differences of those... it's hard to explain, just step through the code and check it out. This does not find sequences of two... like (3,4) in (1, 3, 4, 7, 9). I had to include the remove part for sequences that were off by two... (1, 3, 5, 7). Those weren't caught correctly. Any how, fun exercise. I hope somebody can do better. This is a bit of a mess...

data <- c(1,4,6,7,8,9,20,30,31,32,33,34,35,60)
a <- sequence(rle(diff(data))$lengths)
starts <- which(a==2) - 1
ends <- which(diff(a)<0) + 1
remove <- starts[starts %in% (ends-2)]
starts <- starts[!starts %in% remove]
ends <- ends[!ends %in% (remove+2)]
if(length(ends) < length(starts)) ends <- c(ends, length(data))
> starts
[1] 3 8
> ends
[1]  6 13
> 
cory
  • 6,529
  • 3
  • 21
  • 41
0

Here's a base R solution relying heavily on ?diff:

data <- c(1,4,6,7,8,9,20,30,31,32,33,34,35,60)

diff1 <- diff(data[1:(length(data)-1)]) # lag 1 difference
diff2 <- diff(data, 2) # lag 2 difference

# indices of starting consecutive stretches -- these will overlap
start_index <- which(diff1==1 & diff2==2)
end_index <- start_index + 2

# notice that these overlap:
data.frame(start_index, end_index)

# To remove overlap:
# We can remove *subsequent* consecutive start indices
#           and *initial* consecutive end indices

start_index_new <- start_index[which(c(0, diff(start_index))!=1)]
end_index_new <- end_index[which(c(diff(end_index), 0) != 1)]
data.frame(start_index_new, end_index_new)

#   start_index_new end_index_new
# 1               3             6
# 2               8            13

Cory's answer is great -- this one might just be a little easier to understand because you're basically checking for cases where, from position i, position i+1 has a value of 1 more and position i + 2 has a value of 2 more. You build ranges off of this and then consolidate your ranges with another diff function. To my thinking this is a bit simpler.

There also are packages you can use like zoo that can help you get rolling differences.

C8H10N4O2
  • 18,312
  • 8
  • 98
  • 134