I have a dataframe (tibble). I'm looking for a method to detect specific sequences of variables in a data. There are 3 variables in the reprex, but there can be dozens of them. I'm showing 70 rows of data, and there could be several hundred thousand of them. I have a sequence to detect dataframes in a named list. In the reprex there are 2 sequences marked A and B, but in practice there can be about 100 of them, so I chose this structure to store them.
Data:
library(tidyverse)
data1 <- structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28,
29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44,
45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60,
61, 62, 63, 64, 65, 66, 67, 68, 69, 70), x1 = c("z", "z", "z",
"z", "z", "z", "z", "y", "y", "y", "c", "c", "c", "c", "c", "c",
"c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c",
"a", "z", "z", "z", "z", "z", "z", "z", "z", "z", "z", "z", "z",
"z", "z", "y", "y", "y", "c", "c", "c", "c", "c", "c", "c", "c",
"c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "a", "z",
"z", "z"), x2 = c("z", "z", "z", "z", "z", "z", "z", "y", "y",
"y", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c",
"c", "c", "c", "c", "c", "c", "c", "a", "z", "z", "z", "z", "z",
"z", "z", "z", "z", "z", "z", "z", "z", "z", "y", "y", "y", "c",
"c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c",
"c", "c", "c", "c", "c", "a", "z", "z", "z"), x3 = c("c", "c",
"c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "z", "z", "z",
"z", "z", "z", "z", "z", "z", "z", "f", "f", "f", "f", "c", "c",
"c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c",
"c", "c", "c", "c", "c", "c", "c", "c", "z", "z", "z", "z", "z",
"z", "z", "z", "z", "z", "f", "f", "f", "f", "c", "c", "c", "c",
"c", "c", "c")), row.names = c(NA, -70L), class = c("tbl_df",
"tbl", "data.frame"))
Created on 2023-07-17 with reprex v2.0.2
Sequences to detection:
seqs <- list(A = structure(list(ID = c(1, 2, 3, 4, 5),
x1 = c("y", "y", "y", "c", "c"),
x2 = c("y", "y", "y", "c", "c"),
x3 = c("c", "c", "c", "c", "c")),
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -5L)),
B = structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8),
x1 = c("c", "c", "c", "c", "c", "c", "c", "a"),
x2 = c("c", "c", "c", "c", "c", "c", "c", "a"),
x3 = c("f", "f", "f", "f", "c", "c", "c", "c")),
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -8L)))
Created on 2023-07-17 with reprex v2.0.2
I would like to get such a result, where in the column I get information in which second a sequence starts. The searched sequences in the reprex are separated by other sequences that are not relevant to me. It is important that the detection of the sequence is the detection of the sequence for all variables (sequences may differ very slightly, only by one value of one variable). I only need to find the beginning of the sequence, because its duration is known (the number of lines of the data frame with the pattern of the sequence).
ID x1 x2 x3 det_seq
<dbl> <chr> <chr> <chr> <chr>
1 1 z z c NA
2 2 z z c NA
3 3 z z c NA
4 4 z z c NA
5 5 z z c NA
6 6 z z c NA
7 7 z z c NA
8 8 y y c A
9 9 y y c NA
10 10 y y c NA
11 11 c c c NA
12 12 c c c NA
13 13 c c z NA
14 14 c c z NA
15 15 c c z NA
16 16 c c z NA
17 17 c c z NA
18 18 c c z NA
19 19 c c z NA
20 20 c c z NA
21 21 c c z NA
22 22 c c z NA
23 23 c c f B
24 24 c c f NA
25 25 c c f NA
26 26 c c f NA
27 27 c c c NA
28 28 c c c NA
29 29 c c c NA
30 30 a a c NA
31 31 z z c NA
32 32 z z c NA
33 33 z z c NA
34 34 z z c NA
35 35 z z c NA
36 36 z z c NA
37 37 z z c NA
38 38 z z c NA
39 39 z z c NA
40 40 z z c NA
41 41 z z c NA
42 42 z z c NA
43 43 z z c NA
44 44 z z c NA
45 45 y y c A
46 46 y y c NA
47 47 y y c NA
48 48 c c c NA
49 49 c c c NA
50 50 c c z NA
51 51 c c z NA
52 52 c c z NA
53 53 c c z NA
54 54 c c z NA
55 55 c c z NA
56 56 c c z NA
57 57 c c z NA
58 58 c c z NA
59 59 c c z NA
60 60 c c f B
61 61 c c f NA
62 62 c c f NA
63 63 c c f NA
64 64 c c c NA
65 65 c c c NA
66 66 c c c NA
67 67 a a c NA
68 68 z z c NA
69 69 z z c NA
70 70 z z c NA