Checking for sequences in an R vector

Question

I'm looking for a function or operation such that if I have

A <- c(1, 2, 3, 4, 5)

and B <- c(1, 2, 3)

and C <- c(2, 1)

I'd get a TRUE when checking whether A contained B, and FALSE when checking whether A contained C

basically, the equivalent of the %in% operator but that actually cares about the order of elements

In a perfect world, I'd be able to do this without some kind of apply statement, but I may end up having to

Ideally something generalizable, but i'll take what I can get at this point. I'm working with timestamp data where I have to deal with sequences that wrap, and something that starts on the hour and lasts for an hour will have minutes 0 to 30 in it (the sequence), but something starting 20 minutes past the hour and lasting for an hour will have all those individual minute values in it, but not in sequence. — dtsavage, May 05 '20 at 19:00
An ugly idea: `grepl(paste(B,collapse = '@'),paste(A,collapse = '@'),fixed = TRUE)` — John Coleman, May 05 '20 at 19:06
see https://stackoverflow.com/questions/48660606/get-indexes-of-a-vector-of-numbers-in-another-vector: i.e. something like `A <- c(1, 2, 3, 4, 5); B <- c(1, 2, 3); C <- c(2, 1); dd2 <- function(v, x) { idx <- which(v == x[1L]); xl <- length(x) - 1L; length(idx[sapply(idx, function(i) all(v[i:(i+xl)] == x))]) > 0 } dd2(A, B) ` — chinsoon12, May 05 '20 at 23:38

Dominic van Essen · Answer 1 · 2020-05-06T12:16:36.247

3

Well, if one's allowd to use a kind-of apply loop, then this could work:

"%seq_in%" = function(b,a) any(sapply(1:(length(a)-length(b)+1),function(i) all(a[i:(i+length(b)-1)]==b)))

(edited thanks to bug-finding by John Coleman!)

EDIT 2: I couldn't resist trying to solve the 'non-contiguous' case, too:

# find_subseq() returns positions within vec of ordered elements of x, or stops with NA upon failing
find_subseq = function(x,vec) {
    p=match(x[1],vec)
    if(is.na(p)||length(x)==1){ p } 
    else { c(p,p+find_subseq(x[-1],vec[-seq_len(p)])) }
}
"%seq_somewhere_in%" = function(b,a) all(!is.na(find_subseq(b,a)))

Examples:

1:3 %seq_in% 1:10
[1] TRUE
c(3,1,2) %seq_in% 1:10
[1] FALSE
c(1,2,3) %seq_in% c(3,2,1,2,3)
[1] TRUE
2:1 %seq_in% c(1,2,1)
[1] TRUE
1:3 %seq_somewhere_in% c(1,10,10,2,10,10,10,3,10)
[1] TRUE

edited May 06 '20 at 12:16

answered May 05 '20 at 19:14

Dominic van Essen

822
5
10

But `c(1,2,3) %seq_in% c(3,2,1,2,3)` evaluates to `FALSE`. This solution doesn't seem to work with all vectors – John Coleman May 05 '20 at 19:31
@John Coleman - Oops, that is a horrible embarassment. I forgot a ```+1``` when calculating the ```1:(length(a)-length(b)+1)```. Hopefully it's fixed now. Sorry! – Dominic van Essen May 05 '20 at 20:15
That fixed that bug, but a strange thing is that `2:1 %seq_in% c(1,2,1)` is `FALSE` but `c(2,1) %seq_in% c(1,2,1)` is `TRUE`. Perhaps a better version might use `==` rather than `identical`? – John Coleman May 06 '20 at 11:26
@John Coleman - Yes, you're right, and your suggestion is very good (I've incorporated it and thanked you in the edit!). I was aware that ```identical``` would fail if (for instance) one vector has names and another hasn't, but I must admit that I was caught-out that ```identical(c(2,1),2:1)``` evaluates to ```FALSE```. – Dominic van Essen May 06 '20 at 11:38

ThomasIsCoding · Answer 2 · 2020-05-06T12:03:19.457

2

Maybe you can define a custom function subseq_check like below

subseq_check <- function(x,y) grepl(toString(y),toString(x),fixed = TRUE)

which gives

> subseq_check(A,B)
[1] TRUE

> subseq_check(A,C)
[1] FALSE

A Hard-core approach

subseq_find <- function(x,y) {
  inds <- which(x == head(y,1))
  if (length(inds)==0) return(FALSE)
  any(sapply(inds, function(k) all(x[k:(k+length(y)-1)]==y)))
}

such that

> subseq_find(A,B)
[1] TRUE

> subseq_find(A,C)
[1] FALSE

edited May 06 '20 at 12:03

answered May 05 '20 at 19:42

ThomasIsCoding

96,636
9
24
81

This is a less ugly version of the "ugly idea" I gave in the comments. For most choices of `x` and `y` (including perhaps all of OP's intended input) it works as intended. What bothered me about my own idea (beyond its kludge-like nature) is that approaches which use strings tend to be vulnerable to false positives in some edge cases. For example, the 2-element character vector `c("1, 2","3")` is not a subsequence of the 3-element character vector `c("1","2","3")` but `subseq_check(c("1, 2","3"),c("1","2","3"))` evaluates to `TRUE`. Still, probably good for many cases. – John Coleman May 06 '20 at 11:10
@JohnColeman Yes, agree! I think the this code applies to some cases but not for all. If OP can provide more information regarding the specific scenario for its uses, then it is much easier to approach the truth :) – ThomasIsCoding May 06 '20 at 11:29
I just noticed that I was reading your function backwards. I thought at first that it was checking if `x` was a subsequence of `y` but you did it vice-versa. My edge case works either way. It is partly a matter of taste, but I think that swapping the role of `x` and `y` in your definition so it matches the order in `grepl` would be slightly more natural. Tangentially -- it would be nice if there were something like regular expressions for vectors rather than strings existed. I wonder if something like that exists in some a package. – John Coleman May 06 '20 at 11:37
@JohnColeman You got it right, I set `y` as the subsequence for check. I also added an other approach. I am also wondering if there is some package has covered this functionality – ThomasIsCoding May 06 '20 at 12:00

Checking for sequences in an R vector

2 Answers2

Linked