1

I've been looking for answers and messing around with my code for a couple hours. I have a dataset that looks like the following for a specific ID:

# A tibble: 14 × 3
        ID state orderDate          
     <dbl> <chr> <dttm>             
 1 4227631 1     2022-03-14 19:00:00
 2 4227631 1     2022-03-14 20:00:00
 3 4227631 1     2022-03-15 11:00:00
 4 4227631 0     2022-03-15 11:00:00
 5 4227631 1     2022-03-15 20:00:00
 6 4227631 1     2022-03-16 04:00:00
 7 4227631 0     2022-03-16 04:00:00
 8 4227631 1     2022-03-16 05:00:00
 9 4227631 0     2022-03-16 13:00:00
10 4227631 1     2022-03-16 15:00:00

This occurs for hundreds of IDs. For this example, I am using dplyr to group_by ID. I only care when status changes between values, not if it stays the same.

I want to calculate the cumulative time each ID remains in status 1. The instances where status 1 is repeated multiple times before it changes should be ignored. I have been planning to use lubridate and dplyr to perform the analysis.

Tibble I am using for this example:

structure(list(ID = c(4227631, 4227631, 4227631, 4227631, 4227631, 
4227631, 4227631, 4227631, 4227631, 4227631), state = c("1", 
"1", "1", "0", "1", "1", "0", "1", "0", "1"), orderDate = structure(c(1647284400, 
1647288000, 1647342000, 1647342000, 1647374400, 1647403200, 1647403200, 
1647406800, 1647435600, 1647442800), tzone = "UTC", class = c("POSIXct", 
"POSIXt"))), row.names = c(NA, -10L), class = c("tbl_df", "tbl", 
"data.frame"))

I've tried various solutions such as Cumulative time with reset however I'm having trouble with lag and incorporating it into this specific analysis.

The expected output would maybe look something like this:

And then I would plan to sum all statusOne together to figure out cumulative time spent in this state.

Invite all more elegant solutions or if someone has a link to a prior question.


EDIT Using solution below I figured it out! The solution didn't look at the situations where state 0 immediately followed state 1 and we wanted to look at the total time elapsed between these states.

df %>%
  group_by(ID) %>%
  mutate(max = cumsum(ifelse(orderName == lag(orderName, default = "1"), 0, 1))) %>%
  mutate(hours1 = ifelse(max == lag(max) &
                           orderName=="1", difftime(orderDate, lag(orderDate), units = "h"), NA))  %>% 
  mutate(hours2 = ifelse(orderName=="0" & lag(orderName)=="1", 
                         difftime(orderDate, lag(orderDate), units = "h"), NA)) %>% 
  mutate(hours1 = replace_na(hours1, 0), 
         hours2 = replace_na(hours2, 0)) %>% 
  mutate(hours = hours1+hours2) %>% 
  select(-hours1, -hours2) %>% 
  summarise(total_hours = sum(hours, na.rm = TRUE)) %>% 
  filter(total_hours!=0)
Brad
  • 13
  • 3

1 Answers1

0

This is far from elegant, but at least it appears to provide the correct answer:

library(tidyverse)

df <- structure(list(ID = c(4227631, 4227631, 4227631, 4227631, 4227631, 
                            4227631, 4227631, 4227631, 4227631, 4227631),
                     state = c("1", "1", "1", "0", "1", "1", "0", "1", "0", "1"),
                     orderDate = structure(c(1647284400, 1647288000, 1647342000, 
                                             1647342000, 1647374400, 1647403200,
                                             1647403200, 1647406800, 1647435600, 
                                             1647442800), 
                                           tzone = "UTC",
                                           class = c("POSIXct", "POSIXt"))),
                row.names = c(NA, -10L),
                class = c("tbl_df", "tbl", "data.frame"))

df2 <- df %>%
  group_by(ID) %>%
  mutate(tmp = ifelse(state == lag(state, default = "1"), 0, 1),
         max = cumsum(tmp)) %>%
  mutate(hours = ifelse(max == lag(max), difftime(orderDate, lag(orderDate), units = "h"), NA)) %>%
  select(-tmp)

df3 <- df2 %>%
  group_by(max) %>%
  summarise(max, statusOne = sum(hours, na.rm = TRUE))

df4 <- left_join(df2, df3, by = "max") %>%
  distinct() %>%
  select(-c(max, hours)) %>%
  mutate(statusOne = ifelse(statusOne != 0 & lag(statusOne, default = 1) == statusOne, 0, statusOne))

df4
#> # A tibble: 10 × 4
#> # Groups:   ID [1]
#>         ID state orderDate           statusOne
#>      <dbl> <chr> <dttm>                  <dbl>
#>  1 4227631 1     2022-03-14 19:00:00        16
#>  2 4227631 1     2022-03-14 20:00:00         0
#>  3 4227631 1     2022-03-15 11:00:00         0
#>  4 4227631 0     2022-03-15 11:00:00         0
#>  5 4227631 1     2022-03-15 20:00:00         8
#>  6 4227631 1     2022-03-16 04:00:00         0
#>  7 4227631 0     2022-03-16 04:00:00         0
#>  8 4227631 1     2022-03-16 05:00:00         0
#>  9 4227631 0     2022-03-16 13:00:00         0
#> 10 4227631 1     2022-03-16 15:00:00         0

Created on 2022-04-04 by the reprex package (v2.0.1)


Edit

It's a lot more straightforward to get the total_hours state=1 for each ID:

df %>%
  group_by(ID) %>%
  mutate(max = cumsum(ifelse(state == lag(state, default = "1"), 0, 1))) %>%
  mutate(hours = ifelse(max == lag(max), difftime(orderDate, lag(orderDate), units = "h"), NA)) %>%
  summarise(total_hours = sum(hours, na.rm = TRUE))
#> # A tibble: 1 × 2
#>        ID total_hours
#>     <dbl>       <dbl>
#> 1 4227631          24

Created on 2022-04-04 by the reprex package (v2.0.1)

jared_mamrot
  • 22,354
  • 4
  • 21
  • 46
  • Thanks for the assistance! There was one feature missing when the state 0 immediately follows state 1 and we weren't calculating the duration. I put the code above under my answer in the edit section. – Brad Apr 04 '22 at 16:26