Look for existence of number sequence in every row of data table in R

Question

I am looking to return a logical to a data table whereby a function is applied that establishes whether a certain sequence of numbers exists in that row, regardless of the length of each element of that sequence.

e.g. in c(1,1,1,3,3,2,2,2,2,2,1) I am interested if c(1,3,2) exists in that order. It does not matter how long each element of the nominated sequence is. Using first rle and then "%seq_in%" as defined by a user in this post, we can do the following;

# this function searches for a specific vector in order in another vector
"%seq_in%" = function(b,a) any(sapply(1:(length(a)-length(b)+1),function(i) all(a[i:(i+length(b)-1)]==b)))

v1 <- c(1,1,1,3,3,2,2,2,2,2,1)

c(1,3,2) %seq_in% rle(v1)$values
[1] TRUE

# for clarity
c(1,2,3) %seq_in% rle(v1)$values
[1] FALSE

so, i would like to do the same to a data table, look for a specific sequence, regardless of length of each element, against every row of the data table.

# dummy data
dt_dummy <- data.table(A = c(2,2,3,3,1),B = c(3,2,2,1,3), C = c(2,2,3,3,1), D = c(2,3,2,2,3), 
E = c(2,3,2,1,1), F = c(2,2,2,1,3), G = c(3,2,3,2,2), H = c(2,3,1,2,2))

dt_dummy
   A B C D E F G H
1: 2 3 2 2 2 2 3 2
2: 2 2 2 3 3 2 2 3
3: 3 2 3 2 2 2 3 1
4: 3 1 3 2 1 1 2 2
5: 1 3 1 3 1 3 2 2

# define simple function to return the values from rle
f1 <- function(v){  
 v1 <- unlist(rle(v)$values)
 return(v1)
}

# apply to every row of dt
dt_dummy[, GCG_Rot := c(3,2,3) %seq_in% f1(dt_dummy), by = seq_len(nrow(dt_dummy))]

I cant seem to get the function to work, where the generated column is TRUE or FALSE

Rows 1, 2, & 3 should adhere to the nominated sequence and return TRUE.

If there's a way of dropping %seq_in% i'm all for it!!

score 2 · Accepted Answer · answered Jul 23 '21 at 15:09

You can try unlist over .SD, e.g.,

> dt_dummy[, GCG_RoT := c(3, 2, 3) %seq_in% f1(unlist(.SD)), seq(nrow(dt_dummy))][]
   A B C D E F G H GCG_RoT
1: 2 3 2 2 2 2 3 2    TRUE
2: 2 2 2 3 3 2 2 3    TRUE
3: 3 2 3 2 2 2 3 1    TRUE
4: 3 1 3 2 1 1 2 2   FALSE
5: 1 3 1 3 1 3 2 2   FALSE

Furthermore, you can define a function f like below (no need %seq_in% + f1)

> f <- function(a, b) grepl(toString(a), toString(rle(b)$values))

> dt_dummy[, GCG_RoT := f(c(3, 2, 3), unlist(.SD)), seq(nrow(dt_dummy))][]
   A B C D E F G H GCG_RoT
1: 2 3 2 2 2 2 3 2    TRUE
2: 2 2 2 3 3 2 2 3    TRUE
3: 3 2 3 2 2 2 3 1    TRUE
4: 3 1 3 2 1 1 2 2   FALSE
5: 1 3 1 3 1 3 2 2   FALSE

superb, many thanks. I had tried a grepl for the sequence but failed, i had forgotten to unlist. Thanks again — Sam, Jul 23 '21 at 15:16

Peace Wang · Answer 2 · 2021-07-23T15:27:19.647

1

You can apply a function to each row as

dt_dummy[, GCG_Rot := apply(.SD,1, function(x) c(3,2,3) %seq_in% rle(x)$values)]
#    A B C D E F G H GCG_RoT
# 1: 2 3 2 2 2 2 3 2    TRUE
# 2: 2 2 2 3 3 2 2 3    TRUE
# 3: 3 2 3 2 2 2 3 1    TRUE
# 4: 3 1 3 2 1 1 2 2   FALSE
# 5: 1 3 1 3 1 3 2 2   FALSE

edited Jul 23 '21 at 15:27

answered Jul 23 '21 at 15:02

Peace Wang

2,399
1
8
15

score 1 · Answer 3 · answered Jul 23 '21 at 16:57

An option is also to use dapply from collapse

library(data.table)
library(collapse)
dt_dummy[, GCG_RoT := dapply(.SD, MARGIN = 1, function(x) c(3, 2, 3) %seq_in% f1(x))]

-output

 dt_dummy
   A B C D E F G H GCG_RoT
1: 2 3 2 2 2 2 3 2    TRUE
2: 2 2 2 3 3 2 2 3    TRUE
3: 3 2 3 2 2 2 3 1    TRUE
4: 3 1 3 2 1 1 2 2   FALSE
5: 1 3 1 3 1 3 2 2   FALSE

score 0 · Answer 4 · answered Jul 24 '21 at 07:24

Here is another option which I think should be faster:

#see reference 1
fseqin <- function(x, v) {
    w = seq_along(v)
    for (i in seq_along(x)) {
        w = w[v[w+i-1L] == x[i]]
        if (length(w)==0L || is.na(w)) return(FALSE)
    }
    TRUE
} #fseqin


m <- as.matrix(dt_dummy)
dt_dummy[, found := 
    data.table(row=as.vector(row(m)), col=as.vector(col(m)), v=as.vector(m))[
        order(row, col)][
            !duplicated(rleid(row, v)), fseqin(c(3,2,3), v), row]$V1
]

output:

   A B C D E F G H found
1: 2 3 2 2 2 2 3 2  TRUE
2: 2 2 2 3 3 2 2 3  TRUE
3: 3 2 3 2 2 2 3 1  TRUE
4: 3 1 3 2 1 1 2 2 FALSE
5: 1 3 1 3 1 3 2 2 FALSE

Reference:

Get indexes of a vector of numbers in another vector

Look for existence of number sequence in every row of data table in R

4 Answers4