Identify start and stop sequences and length of run in R

Question

I am trying to loop over a dataframe and find sequences of events between a start and stop object (an event that occurs at both the beginning and end).

Here is some sample data:

time = c('8:20', '8:19', '8:15', '8:14', '8:14', '8:10', '8:04', '8:03', '8:00', '7:59', '7:55', '7:44', '7:43','7:42')

action = c('A', 'B', 'C', 'B', 'F', 'T', 'Z', 'U', 'A', 'G', 'B', 'C', 'L', 'Z')


group = c('group1', 'group1', 'group1', 'group2', 'group1', 'group1', 'group2', 'group2','group2', 'group2', 'group2', 'group2', 'group1', 'group1')


test.df = cbind(time, action, group) %>% data.frame()

The full data set is longer and wider, but this should suffice.

The rules are, that if one group (either group1 or group2) registers action 'A' and only 'A', it starts the sequence of the run. Any number of events can occur following that, until the opposite group (group2 if group1 initiated 'A', or group 1 if it's the reverse) logs action 'Z'. Action 'Z' by the opposite group signifies the 'end' point of the sequence.

This process iterates hundreds of time over the dataframe.

Each time one of the group starts action 'A', I want every subsequent event to be linked with an ID value that sums for each time the group starts a new sequence over the dataframe, until action 'Z' is taken by the opposite group.

I.E., in the above sample, there would be a new column identifying that it was 'group1' to which the sequence belongs to and this is ID 1, and their next sequence that initiates later in the data set would be ID 2 for group 1, etc.

time   action   group  group.sequence  id
8:20   A        group1 group1          1
8:19   B        group1 group1          1
8:15   C        group1 group1          1
8:14   B        group2 group1          1
8:14   F        group1 group1          1
[...]

That way, summation on time, # of actions in between, types of actions in between can be found. Any actions that occur outside the 'A' to 'Z' actions of a group (example, row 8) can be ignored for now.

Prefer something I can use in my dplyr pipe, but open to any solutions that achieve success.

score 1 · Accepted Answer · answered May 31 '19 at 02:53

Here is my attempt using tidyverse. Run the code with a larger dataframe and let me know if your expected answer differs from mine.

library(tidyverse)

test.df %>%
  mutate_if(is.factor, as.character) %>%
  filter(action != "U") %>%
  mutate(temp = ifelse(paste(group, action) %in% 
                         c("group1 A", "group2 A", "group1 Z", "group2 Z"), 
                       paste(group, action), NA),
         group.sequence = ifelse(temp %in% c("group1 Z", "group2 Z"), NA, temp),
         group.sequence = ifelse(!is.na(group.sequence), group, NA)) %>%
  group_by(group.sequence) %>%
  mutate(id = 1:n(),
         id = ifelse(is.na(group.sequence), NA, id)) %>%
  ungroup() %>%
  fill(c(group.sequence, id)) %>%
  select(-temp)
#> # A tibble: 13 x 5
#>    time  action group  group.sequence    id
#>    <chr> <chr>  <chr>  <chr>          <int>
#>  1 8:20  A      group1 group1             1
#>  2 8:19  B      group1 group1             1
#>  3 8:15  C      group1 group1             1
#>  4 8:14  B      group2 group1             1
#>  5 8:14  F      group1 group1             1
#>  6 8:10  T      group1 group1             1
#>  7 8:04  Z      group2 group1             1
#>  8 8:00  A      group2 group2             1
#>  9 7:59  G      group2 group2             1
#> 10 7:55  B      group2 group2             1
#> 11 7:44  C      group2 group2             1
#> 12 7:43  L      group1 group2             1
#> 13 7:42  Z      group1 group2             1

Almost 100% there, I said we can ignore row 8, but didn't mean we can remove it literally. Right now in the expanded dataframe, after I run the script, there are some instances after a 'Z' action where other events occur before the next 'A' action for either team that are grouped into the same id #. Do you have a recommendation on how to account for that? — wetcoaster, Jun 01 '19 at 00:19
For reference, the quick hack I used was to add a step in between your code and before the fill function where I changed the value of `id` to 0 after event Z, as it always signifies the break in the actions. Then, as I use the fill function it accounts for the breaks and only assigns sequence id's to the actions between the `A` and `Z`. — wetcoaster, Jun 03 '19 at 23:51

Identify start and stop sequences and length of run in R

1 Answers1

Linked