Sequence detection in data.frame

Question

I have a dataframe (tibble). I'm looking for a method to detect specific sequences of variables in a data. There are 3 variables in the reprex, but there can be dozens of them. I'm showing 70 rows of data, and there could be several hundred thousand of them. I have a sequence to detect dataframes in a named list. In the reprex there are 2 sequences marked A and B, but in practice there can be about 100 of them, so I chose this structure to store them.

Data:

library(tidyverse)
data1 <- structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 
                               13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 
                               29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 
                               45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 
                               61, 62, 63, 64, 65, 66, 67, 68, 69, 70), x1 = c("z", "z", "z", 
                                                                               "z", "z", "z", "z", "y", "y", "y", "c", "c", "c", "c", "c", "c", 
                                                                               "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", 
                                                                               "a", "z", "z", "z", "z", "z", "z", "z", "z", "z", "z", "z", "z", 
                                                                               "z", "z", "y", "y", "y", "c", "c", "c", "c", "c", "c", "c", "c", 
                                                                               "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "a", "z", 
                                                                               "z", "z"), x2 = c("z", "z", "z", "z", "z", "z", "z", "y", "y", 
                                                                                                 "y", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", 
                                                                                                 "c", "c", "c", "c", "c", "c", "c", "a", "z", "z", "z", "z", "z", 
                                                                                                 "z", "z", "z", "z", "z", "z", "z", "z", "z", "y", "y", "y", "c", 
                                                                                                 "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", 
                                                                                                 "c", "c", "c", "c", "c", "a", "z", "z", "z"), x3 = c("c", "c", 
                                                                                                                                                      "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "z", "z", "z", 
                                                                                                                                                      "z", "z", "z", "z", "z", "z", "z", "f", "f", "f", "f", "c", "c", 
                                                                                                                                                      "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", 
                                                                                                                                                      "c", "c", "c", "c", "c", "c", "c", "c", "z", "z", "z", "z", "z", 
                                                                                                                                                      "z", "z", "z", "z", "z", "f", "f", "f", "f", "c", "c", "c", "c", 
                                                                                                                                                      "c", "c", "c")), row.names = c(NA, -70L), class = c("tbl_df", 
                                                                                                                                                                                                          "tbl", "data.frame"))

^{Created on 2023-07-17 with reprex v2.0.2}

Sequences to detection:

seqs <- list(A = structure(list(ID = c(1, 2, 3, 4, 5),
                        x1 = c("y", "y", "y", "c", "c"),
                        x2 = c("y", "y", "y", "c", "c"),
                        x3 = c("c", "c", "c", "c", "c")),
                   class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -5L)),
     B = structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8),
                        x1 = c("c", "c", "c", "c", "c", "c", "c", "a"),
                        x2 = c("c", "c", "c", "c", "c", "c", "c", "a"),
                        x3 = c("f", "f", "f", "f", "c", "c", "c", "c")),
                   class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -8L)))

^{Created on 2023-07-17 with reprex v2.0.2}

I would like to get such a result, where in the column I get information in which second a sequence starts. The searched sequences in the reprex are separated by other sequences that are not relevant to me. It is important that the detection of the sequence is the detection of the sequence for all variables (sequences may differ very slightly, only by one value of one variable). I only need to find the beginning of the sequence, because its duration is known (the number of lines of the data frame with the pattern of the sequence).

      ID x1    x2    x3    det_seq
   <dbl> <chr> <chr> <chr> <chr>  
 1     1 z     z     c     NA     
 2     2 z     z     c     NA     
 3     3 z     z     c     NA     
 4     4 z     z     c     NA     
 5     5 z     z     c     NA     
 6     6 z     z     c     NA     
 7     7 z     z     c     NA     
 8     8 y     y     c     A      
 9     9 y     y     c     NA     
10    10 y     y     c     NA     
11    11 c     c     c     NA     
12    12 c     c     c     NA     
13    13 c     c     z     NA     
14    14 c     c     z     NA     
15    15 c     c     z     NA     
16    16 c     c     z     NA     
17    17 c     c     z     NA     
18    18 c     c     z     NA     
19    19 c     c     z     NA     
20    20 c     c     z     NA     
21    21 c     c     z     NA     
22    22 c     c     z     NA     
23    23 c     c     f     B      
24    24 c     c     f     NA     
25    25 c     c     f     NA     
26    26 c     c     f     NA     
27    27 c     c     c     NA     
28    28 c     c     c     NA     
29    29 c     c     c     NA     
30    30 a     a     c     NA     
31    31 z     z     c     NA     
32    32 z     z     c     NA     
33    33 z     z     c     NA     
34    34 z     z     c     NA     
35    35 z     z     c     NA     
36    36 z     z     c     NA     
37    37 z     z     c     NA     
38    38 z     z     c     NA     
39    39 z     z     c     NA     
40    40 z     z     c     NA     
41    41 z     z     c     NA     
42    42 z     z     c     NA     
43    43 z     z     c     NA     
44    44 z     z     c     NA     
45    45 y     y     c     A      
46    46 y     y     c     NA     
47    47 y     y     c     NA     
48    48 c     c     c     NA     
49    49 c     c     c     NA     
50    50 c     c     z     NA     
51    51 c     c     z     NA     
52    52 c     c     z     NA     
53    53 c     c     z     NA     
54    54 c     c     z     NA     
55    55 c     c     z     NA     
56    56 c     c     z     NA     
57    57 c     c     z     NA     
58    58 c     c     z     NA     
59    59 c     c     z     NA     
60    60 c     c     f     B      
61    61 c     c     f     NA     
62    62 c     c     f     NA     
63    63 c     c     f     NA     
64    64 c     c     c     NA     
65    65 c     c     c     NA     
66    66 c     c     c     NA     
67    67 a     a     c     NA     
68    68 z     z     c     NA     
69    69 z     z     c     NA     
70    70 z     z     c     NA

some approaches to consider here: https://stackoverflow.com/a/16537008/16730940 — Paul Stafford Allen, Jul 17 '23 at 15:49

Mark · Accepted Answer · 2023-07-17T17:27:35.740

0

Here's one approach:

data1 %>% 
  mutate(det_seq = map_chr(seq_along(1:nrow(data1)), 
                             ~ case_when(identical(data1[.x:(.x+4), 2:4], seqs$A[,2:4]) ~ "A",
                                   identical(data1[.x:(.x+7), 2:4], seqs$B[,2:4]) ~ "B",
                                   TRUE ~ "NA")))

Update: To make it so that it can match a seqs list of dataframes of any size, use the following chunk of code instead:

data1 %>% 
  mutate(det_seq = map_chr(seq_along(1:nrow(data1)), 
                    \(x)  first(names(seqs)[map_lgl(seqs,
                     \(s) identical(data1[x:(x+nrow(s)-1), 2:4], s[,2:4]))])))

edited Jul 17 '23 at 17:27

answered Jul 17 '23 at 15:54

Mark

7,785
2
14
34

This is what I was looking for. I just would like to automate entering the length of the sequence and its name. I may have about 100 of sequences. You can get length and namse from my data with `seqs %>% map(nrow) %>% flatten_int()` and `names(seqs)` – wacekk Jul 17 '23 at 16:20
One solution would be to automatically generate the code for the condition (I'm using the `stringi` package), but I'd prefer something that functions as plain r code so that it can be used in a loop or function, for example. `cat(stri_join("identical(data1[.x:(.x+", seqs %>% map_int(nrow)-1, "), 2:4], seqs$", names(seqs), "[,2:4]) ~ \"", names(seqs), "\",\n"))` – wacekk Jul 17 '23 at 16:35
@wacekk whoever told you map calls can't work within a function was lying to you. Loops are slow in R, but they might be necessary because of the complicated nature of your request – Mark Jul 17 '23 at 16:38
@wacekk updated! – Mark Jul 17 '23 at 17:27
if multiple sequences match, you can use something else other than `first()` – Mark Jul 17 '23 at 17:28
1

Thank you @Mark. It's exactly what I was looking for. – wacekk Jul 17 '23 at 18:14

Sequence detection in data.frame

1 Answers1