2

I have the following dataframe called df (dput below):

   group                date indicator value
1      A 2022-11-01 01:00:00     FALSE     2
2      A 2022-11-01 03:00:00     FALSE     1
3      A 2022-11-01 04:00:00     FALSE     2
4      A 2022-11-01 05:00:00     FALSE     1
5      A 2022-11-01 06:00:00      TRUE     1
6      A 2022-11-01 07:00:00     FALSE     1
7      A 2022-11-01 10:00:00     FALSE     2
8      A 2022-11-01 12:00:00     FALSE     1
9      B 2022-11-01 01:00:00     FALSE     1
10     B 2022-11-01 02:00:00     FALSE     2
11     B 2022-11-01 03:00:00     FALSE     1
12     B 2022-11-01 06:00:00      TRUE     1
13     B 2022-11-01 07:00:00     FALSE     1
14     B 2022-11-01 08:00:00     FALSE     1
15     B 2022-11-01 11:00:00     FALSE     2
16     B 2022-11-01 13:00:00     FALSE     2

I would like to find the first rows that have a value change after with respect to the rows with indicator == TRUE per group. This means that it should find row 7 for group A and row 15 for group B because they are both the first rows that have a value change after and with respect to the conditioned rows. Here is the desired output called df_desired:

  group                date indicator value
1     A 2022-11-01 06:00:00      TRUE     1
2     A 2022-11-01 10:00:00     FALSE     2
3     B 2022-11-01 06:00:00      TRUE     1
4     B 2022-11-01 11:00:00     FALSE     2

So I was wondering if anyone knows how to find the desired rows with conditioned rows like in df_desired?


Here dput of df and df_desired:

df <- structure(list(group = c("A", "A", "A", "A", "A", "A", "A", "A", 
"B", "B", "B", "B", "B", "B", "B", "B"), date = structure(c(1667260800, 
1667268000, 1667271600, 1667275200, 1667278800, 1667282400, 1667293200, 
1667300400, 1667260800, 1667264400, 1667268000, 1667278800, 1667282400, 
1667286000, 1667296800, 1667304000), class = c("POSIXct", "POSIXt"
), tzone = ""), indicator = c(FALSE, FALSE, FALSE, FALSE, TRUE, 
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, 
FALSE, FALSE), value = c(2, 1, 2, 1, 1, 1, 2, 1, 1, 2, 1, 1, 
1, 1, 2, 2)), row.names = c(NA, -16L), class = "data.frame")

df_desired <- structure(list(group = c("A", "A", "B", "B"), date = c("2022-11-01 06:00:00", 
"2022-11-01 10:00:00", "2022-11-01 06:00:00", "2022-11-01 11:00:00"
), indicator = c(TRUE, FALSE, TRUE, FALSE), value = c(1, 2, 1, 
2)), class = "data.frame", row.names = c(NA, -4L))
Quinten
  • 35,235
  • 5
  • 20
  • 53
  • I think [this question](https://stackoverflow.com/questions/73589598/group-data-table-by-consecutive-runs-of-two-id-variables-without-using-split) and [this question](https://stackoverflow.com/questions/73523083/group-data-frame-by-the-consecutive-runs-of-an-id-variable-and-find-earliest-and/73523315#73523315) might be of use. – diomedesdata Nov 23 '22 at 08:39

2 Answers2

3

Here is one way to do it,

library(dplyr)

df %>% 
 group_by(group) %>% 
 mutate(val_diff = value - lag(value), 
        new = row_number()[indicator], 
        new1 = (val_diff == 1) & (row_number() > new)) %>% 
 filter(indicator|new1) %>% 
 select(-c(val_diff, new, new1))

# A tibble: 4 × 4
# Groups:   group [2]
  group date                indicator value
  <chr> <dttm>              <lgl>     <dbl>
1 A     2022-11-01 08:00:00 TRUE          1
2 A     2022-11-01 12:00:00 FALSE         2
3 B     2022-11-01 08:00:00 TRUE          1
4 B     2022-11-01 13:00:00 FALSE         2

DATA USED from OP dput

  group                date indicator value
1      A 2022-11-01 03:00:00     FALSE     2
2      A 2022-11-01 05:00:00     FALSE     1
3      A 2022-11-01 06:00:00     FALSE     2
4      A 2022-11-01 07:00:00     FALSE     1
5      A 2022-11-01 08:00:00      TRUE     1
6      A 2022-11-01 09:00:00     FALSE     1
7      A 2022-11-01 12:00:00     FALSE     2
8      A 2022-11-01 14:00:00     FALSE     1
9      B 2022-11-01 03:00:00     FALSE     1
10     B 2022-11-01 04:00:00     FALSE     2
11     B 2022-11-01 05:00:00     FALSE     1
12     B 2022-11-01 08:00:00      TRUE     1
13     B 2022-11-01 09:00:00     FALSE     1
14     B 2022-11-01 10:00:00     FALSE     1
15     B 2022-11-01 13:00:00     FALSE     2
16     B 2022-11-01 15:00:00     FALSE     2
Sotos
  • 51,121
  • 6
  • 32
  • 66
2

Using by.

by(df, df$group, \(x) {
  u <- x[cumsum(x$indicator) > 0, ]
  u[c(1L, which.max(u$value != u$value[1L])), ]
}) |> do.call(what=rbind)
#      group                date indicator value
# A.5      A 2022-11-01 06:00:00      TRUE     1
# A.7      A 2022-11-01 10:00:00     FALSE     2
# B.12     B 2022-11-01 06:00:00      TRUE     1
# B.15     B 2022-11-01 11:00:00     FALSE     2
jay.sf
  • 60,139
  • 8
  • 53
  • 110