Get runs of consecutive integers of certain length and sample from first values

Question

I am trying to create a function that will return the first integer of a subset of a vector such that the values of the subset are discrete, increasing by 1, and of a specified length.

For example, using the input data 'v' and a specified length 'l' of 3:

v <- c(3, 4, 5, 6, 15, 16, 25, 26, 27)
l <- 3

The possible sub-vectors of consecutive values of length 3 would be:

c(3, 4, 5)
c(4, 5, 6)
c(25, 26, 27)

Then I want to randomly choose one of these vectors and return the first/lowest number, i.e. 3, 4, or 25.

This should get you going: [How to split a vector into groups of consecutive sequences?](https://stackoverflow.com/questions/5222061/how-to-partition-a-vector-into-groups-of-regular-consecutive-sequences), which I believe is the canonical post for the very useful idiom to create groups of sequences: `cumsum(c(1L, diff(v) != 1))`. You need to clarify the "_Randomly identify_". — Henrik, Jun 12 '20 at 14:25

Ian Campbell · Accepted Answer · 2020-06-12T15:21:48.957

Here's an approach with base R:

First, we create all possible sub-vectors of length length. Next, we subset that list of vectors based on the cumsum of their difference equalling 1. The is.na test ensures the last vectors which contain NA are also filtered out. Then we just bind the remaining vectors into a matrix and sample the first column.

SampleSequencialVectors <- function(vec, length){
  all.vecs <- lapply(seq_along(vec),function(x)vec[x:(x+(length-1))])
  seq.vec <- all.vecs[sapply(all.vecs,function(x) all(diff(x) == 1 & !is.na(diff(x))))]
  sample(do.call(rbind,seq.vec)[,1],1)
}

replicate(10, SampleSequencialVectors(v, 3))
# [1]  3  4  3  3  4  4 25 25  3 25

Or if you prefer a tidyverse type approach:

SampleSequencialVectorsPurrr <- function(vec, length){
  vec %>%
    seq_along %>%
    purrr::map(~vec[.x:(.x+(length-1))]) %>%
    purrr::keep(~ all(diff(.x) == 1 & !is.na(diff(.x)))) %>%
    purrr::invoke(rbind,.) %>%
    {sample(.[,1],size = 1)}
}
replicate(10, SampleSequencialVectorsPurrr(v, 3))
 [1]  4 25 25  3 25  4  4  3  4 25

Henrik · Answer 2 · 2020-06-12T20:38:45.127

Split the vector into runs of consecutive values*: split(v, cumsum(c(1L, diff(v) != 1)))
Select runs of length above or equal to the limit: runs[lengths(runs) >= lim]
From each run, select the possible first values (x[1:(length(x) - lim + 1)]).

From all possible first values, sample 1.

runs = split(v, cumsum(c(1L, diff(v) != 1)))

first = lapply(runs[lengths(runs) >= lim], function(x) x[1:(length(x) - lim + 1)])

sample(unlist(first), 1)

Here we loop over runs of sufficient length, and not all individual values (see the other answers), thus it may be faster on larger vectors (haven't tested).

Slightly more compact using data.table:

 sample(data.table(v)[ , if(.N >= 3) v[1:(length(v) - lim + 1)],
                       by = .(cumsum(c(1L, diff(v) != 1)))]$V1, 1)

*Credits to the nice canonical: How to split a vector into groups of consecutive sequences?.

hello_friend · Answer 3 · 2020-06-12T16:30:47.733

2

Base R two lines: Please note this solution assumes v is sorted.

consec_seq <- sapply(seq_along(v), function(i)split(v, abs(v - v[i]) > 1)[1])
consec_seq[lengths(consec_seq) == l][sample.int(l, 1)]

As a reusable function (not assuming sorted v):

conseq_split_sample <- function(vec, n){ 
  v <- sort(vec)
  consec_seq <- sapply(seq_along(v), function(i)split(v, abs(v - v[i]) > 1)["FALSE"])
  consec_seq[lengths(consec_seq) == n][sample.int(n, 1)]
}
conseq_split_sample(v, l)

Data:

 l <- 3
 v <- c(3, 4, 5, 6, 15, 16, 25, 26, 27)

edited Jun 12 '20 at 16:30

answered Jun 12 '20 at 15:38

hello_friend

5,682
1
11
15

1

@IanCampbell My apologies re-read the question. It requires random selection. Have amended answer above. Thank you for pointing that out (Y) – hello_friend Jun 12 '20 at 16:15

score 0 · Answer 4 · answered Jun 12 '20 at 15:07

0

Tooting my own horn -- cgwtools::seqle is like rle but you can specify the desired increment in a run. seqle(x, incr = 0,..) is the same as rle(x)

Then just grab the run lengths and starting values from the result.

answered Jun 12 '20 at 15:07

Carl Witthoft

20,573
9
43
73

Get runs of consecutive integers of certain length and sample from first values

4 Answers4