5

I am trying to create a function that will return the first integer of a subset of a vector such that the values of the subset are discrete, increasing by 1, and of a specified length.

For example, using the input data 'v' and a specified length 'l' of 3:

v <- c(3, 4, 5, 6, 15, 16, 25, 26, 27)
l <- 3

The possible sub-vectors of consecutive values of length 3 would be:

c(3, 4, 5)
c(4, 5, 6)
c(25, 26, 27)

Then I want to randomly choose one of these vectors and return the first/lowest number, i.e. 3, 4, or 25.

Henrik
  • 65,555
  • 14
  • 143
  • 159
mallard
  • 65
  • 4
  • 1
    This should get you going: [How to split a vector into groups of consecutive sequences?](https://stackoverflow.com/questions/5222061/how-to-partition-a-vector-into-groups-of-regular-consecutive-sequences), which I believe is the canonical post for the very useful idiom to create groups of sequences: `cumsum(c(1L, diff(v) != 1))`. You need to clarify the "_Randomly identify_". – Henrik Jun 12 '20 at 14:25

4 Answers4

3

Here's an approach with base R:

First, we create all possible sub-vectors of length length. Next, we subset that list of vectors based on the cumsum of their difference equalling 1. The is.na test ensures the last vectors which contain NA are also filtered out. Then we just bind the remaining vectors into a matrix and sample the first column.

SampleSequencialVectors <- function(vec, length){
  all.vecs <- lapply(seq_along(vec),function(x)vec[x:(x+(length-1))])
  seq.vec <- all.vecs[sapply(all.vecs,function(x) all(diff(x) == 1 & !is.na(diff(x))))]
  sample(do.call(rbind,seq.vec)[,1],1)
}

replicate(10, SampleSequencialVectors(v, 3))
# [1]  3  4  3  3  4  4 25 25  3 25

Or if you prefer a tidyverse type approach:

SampleSequencialVectorsPurrr <- function(vec, length){
  vec %>%
    seq_along %>%
    purrr::map(~vec[.x:(.x+(length-1))]) %>%
    purrr::keep(~ all(diff(.x) == 1 & !is.na(diff(.x)))) %>%
    purrr::invoke(rbind,.) %>%
    {sample(.[,1],size = 1)}
}
replicate(10, SampleSequencialVectorsPurrr(v, 3))
 [1]  4 25 25  3 25  4  4  3  4 25
Ian Campbell
  • 23,484
  • 14
  • 36
  • 57
3
  1. Split the vector into runs of consecutive values*: split(v, cumsum(c(1L, diff(v) != 1)))
  2. Select runs of length above or equal to the limit: runs[lengths(runs) >= lim]
  3. From each run, select the possible first values (x[1:(length(x) - lim + 1)]).
  4. From all possible first values, sample 1.

    runs = split(v, cumsum(c(1L, diff(v) != 1)))
    
    first = lapply(runs[lengths(runs) >= lim], function(x) x[1:(length(x) - lim + 1)])
    
    sample(unlist(first), 1)
    

Here we loop over runs of sufficient length, and not all individual values (see the other answers), thus it may be faster on larger vectors (haven't tested).


Slightly more compact using data.table:

 sample(data.table(v)[ , if(.N >= 3) v[1:(length(v) - lim + 1)],
                       by = .(cumsum(c(1L, diff(v) != 1)))]$V1, 1)

*Credits to the nice canonical: How to split a vector into groups of consecutive sequences?.

Henrik
  • 65,555
  • 14
  • 143
  • 159
2

Base R two lines: Please note this solution assumes v is sorted.

consec_seq <- sapply(seq_along(v), function(i)split(v, abs(v - v[i]) > 1)[1])
consec_seq[lengths(consec_seq) == l][sample.int(l, 1)]

As a reusable function (not assuming sorted v):

conseq_split_sample <- function(vec, n){ 
  v <- sort(vec)
  consec_seq <- sapply(seq_along(v), function(i)split(v, abs(v - v[i]) > 1)["FALSE"])
  consec_seq[lengths(consec_seq) == n][sample.int(n, 1)]
}
conseq_split_sample(v, l)

Data:

 l <- 3
 v <- c(3, 4, 5, 6, 15, 16, 25, 26, 27)
hello_friend
  • 5,682
  • 1
  • 11
  • 15
  • 1
    @IanCampbell My apologies re-read the question. It requires random selection. Have amended answer above. Thank you for pointing that out (Y) – hello_friend Jun 12 '20 at 16:15
0

Tooting my own horn -- cgwtools::seqle is like rle but you can specify the desired increment in a run. seqle(x, incr = 0,..) is the same as rle(x)

Then just grab the run lengths and starting values from the result.

Carl Witthoft
  • 20,573
  • 9
  • 43
  • 73