2

I have a long dataframe df1 for which I want to select only those rows that satisfy next: df1$Delay rows are CONSECUTIVE and they completely match values of vector A IN THE SAME ORDER. As an example:

df1 <- data.frame(DateTime=c("2016-08-28 12:02:34.589","2016-08-28 12:03:23.589","2016-08-28 12:04:28.589","2016-08-28 12:07:56.589","2016-08-28 12:10:14.589","2016-08-28 12:12:34.589","2016-08-28 12:13:44.589","2016-08-28 12:15:43.589","2016-08-28 12:19:28.589","2016-08-28 12:22:34.589"),
                     Depth=c(12,34,56,12,3,45,52,23,32,18),
                     Delay=c(56,34,90,36,78,90,34,56,101,56))
df1

                  DateTime Depth Delay
1  2016-08-28 12:02:34.589    12    56
2  2016-08-28 12:03:23.589    34    34
3  2016-08-28 12:04:28.589    56    90
4  2016-08-28 12:07:56.589    12    36
5  2016-08-28 12:10:14.589     3    78
6  2016-08-28 12:12:34.589    45    90
7  2016-08-28 12:13:44.589    52    34
8  2016-08-28 12:15:43.589    23    56
9  2016-08-28 12:19:28.589    32   101
10 2016-08-28 12:22:34.589    18    56

A <- c(90,34,56)

In this case, I would expect to get his:

Result

                 DateTime Depth Delay
1 2016-08-28 12:12:34.589    45    90
2 2016-08-28 12:13:44.589    52    34
3 2016-08-28 12:15:43.589    23    56

Dekike
  • 1,264
  • 6
  • 17
  • 3
    Does this answer your question? [Matching a sequence in a larger vector](https://stackoverflow.com/questions/16244006/matching-a-sequence-in-a-larger-vector) – Darren Tsai Apr 13 '20 at 11:50

4 Answers4

1

Here is a base R solution

dfout <- Filter(length,
                sapply(which(df1$Delay==head(A,1)),
                       function(k) {if (all(df1$Delay[k-1+seq_along(A)]==A)) df1[k-1+seq_along(A),] else NULL}))

yielding

> dfout
[[1]]
                 DateTime Depth Delay
6 2016-08-28 12:12:34.589    45    90
7 2016-08-28 12:13:44.589    52    34
8 2016-08-28 12:15:43.589    23    56
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
1

A loop solution

library(data.table)

df1 <- data.frame(DateTime=c("2016-08-28 12:02:34.589","2016-08-28 12:03:23.589","2016-08-28 12:04:28.589","2016-08-28 12:07:56.589","2016-08-28 12:10:14.589","2016-08-28 12:12:34.589","2016-08-28 12:13:44.589","2016-08-28 12:15:43.589","2016-08-28 12:19:28.589","2016-08-28 12:22:34.589"),
                  Depth=c(12,34,56,12,3,45,52,23,32,18),
                  Delay=c(56,34,90,36,78,90,34,56,101,56))
df1
#>                   DateTime Depth Delay
#> 1  2016-08-28 12:02:34.589    12    56
#> 2  2016-08-28 12:03:23.589    34    34
#> 3  2016-08-28 12:04:28.589    56    90
#> 4  2016-08-28 12:07:56.589    12    36
#> 5  2016-08-28 12:10:14.589     3    78
#> 6  2016-08-28 12:12:34.589    45    90
#> 7  2016-08-28 12:13:44.589    52    34
#> 8  2016-08-28 12:15:43.589    23    56
#> 9  2016-08-28 12:19:28.589    32   101
#> 10 2016-08-28 12:22:34.589    18    56

setDT(df1)

A <- c(90,34,56)

df1[,flag:={v <- vector()
for (i in 1:.N) {
  if (identical(Delay[(i):(i+2)],A)) {
    v[(i):(i+2)] <- TRUE
  } else if (is.na(v[i])){
    v[i] <- FALSE
  }
}
v}]

df1[flag==TRUE]
#>                   DateTime Depth Delay flag
#> 1: 2016-08-28 12:12:34.589    45    90 TRUE
#> 2: 2016-08-28 12:13:44.589    52    34 TRUE
#> 3: 2016-08-28 12:15:43.589    23    56 TRUE

Created on 2020-04-13 by the reprex package (v0.3.0)

Frank Zhang
  • 1,670
  • 7
  • 14
1

Taking help from @G. Grothendieck answer in this post we can use it as :

inds <- which(zoo::rollapplyr(df1$Delay, length(A), function(x) 
                   all(A == x), fill = NA))

df1[c(mapply(`:`, inds - 2, inds)), ]

#                 DateTime Depth Delay
#1 2016-08-28 12:12:34.589    45    90
#2 2016-08-28 12:13:44.589    52    34
#3 2016-08-28 12:15:43.589    23    56

If we have only one such occurrence in the dataset we can skip the mapply part.

df1[(inds - length(A) + 1) : inds, ]
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Thanks @Ronak, one doubt. If I have an `A` vector of length bigger than 3, how should I modified the code `df1[c(mapply(`:`, inds - 2, inds)), ]`? I used it with my real data and after applying this mentioned code I only got 3 rows, when actually, `A` had a length of 17. Can you tell me how to generalize the code to use it in different situations where `A` has a different length? – Dekike Apr 13 '20 at 14:46
  • Would there be always only one occurrence of `A` in `df1` or there could be multiple such occurrence and you want to extract them all? A general approach would be : `df1[c(mapply(\`:\`, inds - length(A) - 1, inds)), ]` – Ronak Shah Apr 13 '20 at 14:48
  • In theory, only one occurrence of `A` within `df1`. – Dekike Apr 13 '20 at 14:52
  • 1
    Then we don't need `mapply`. Updated the answer. – Ronak Shah Apr 13 '20 at 14:55
  • Hi @Ronak, one last doubt. If I run the code `df1[c(mapply(`:`, inds - 2, inds)), ]` I get only the first three rows of the total rows that match between the dataframe and the vector. However, if I apply the code `df1[(inds - length(A) + 1) : inds, ]` I get all rows that match (so the second code works well). I just wondered why with the first code, that is supposed to be used when there might be more than one coincidence of vector `A` in `df1`, I only get the first three rows of that coincidence. – Dekike Apr 14 '20 at 08:40
  • 1
    Yes, because I hardcoded 2 there (which I should not have done). So if `inds` is 10 it will always return only 3 rows i.e `inds - 2` (8) : `inds` (10) so 8, 9 and 10. `(inds - length(A) + 1) : inds` is a general solution that must be used. – Ronak Shah Apr 14 '20 at 08:50
1

Here is an option using joins using data.table:

library(data.table)

#convert into a data.table for joning
A <- as.list(c(90, 34, 56))
v <- seq_along(A)

#lead the Delay values by steps of 0, 1, 2 and join on A
rows <- setDT(df1)[, shift(Delay, -(v - 1L))][
    A, on=paste0("V", v), which=TRUE]

#extract the first row and subsequent rows
df1[sapply(rows, function(k) k + seq_along(A) - 1L)]
chinsoon12
  • 25,005
  • 4
  • 25
  • 35