Create start and endtime columns based on multiple conditions in R (dplyr, lubridate)

Question

I have a dataset, df

 Read      Box       ID      Time
 T         out               10/1/2019 9:00:01 AM
 T         out               10/1/2019 9:00:02 AM
 T         out               10/1/2019 9:00:03 AM
 T         out               10/1/2019 9:02:59 AM
 T         out               10/1/2019 9:03:00 AM
 F                           10/1/2019 9:05:00 AM
 T         out               10/1/2019 9:06:00 AM
 T         out               10/1/2019 9:06:02 AM
 T         in                10/1/2019 9:07:00 AM
 T         in                10/1/2019 9:07:02 AM
 T         out               10/1/2019 9:07:04 AM
 T         out               10/1/2019 9:07:05 AM
 T         out               10/1/2019 9:07:06 AM
           hello             10/1/2019 9:07:08 AM

Based on certain conditions within this dataset, I would like to create a startime column and an endtime column. I would like to create a 'starttime' when the following occurs: Read == "T", Box == "out" and ID == "" When the first instance of this condition occurs, a starttime will be generated. For example for this dataset, the starttime will be 10/1/2019 9:00:01 AM since this is where we see the desired conditions occurs first (Read = T, Box = out and ID = "" ) However, the moment when anyone of these conditions is not true, and endtime will be created. So the first endtime would occur right before row 6, where the time is 10/1/2019 9:03:00 AM. My ultimate goal is to then create a duration column for this.

This is my desired output:

  starttime                    endtime                     duration

  10/01/2019 9:00:01 AM        10/01/2019 9:03:00 AM       179 secs
  10/1/2019 9:06:00 AM         10/1/2019 9:06:02 AM        2 secs
  10/1/2019 9:07:04 AM         10/1/2019 9:07:06 AM        2 secs

dput:

  structure(list(Read = structure(c(3L, 3L, 3L, 3L, 3L, 2L, 3L, 
  3L, 3L, 3L, 4L, 4L, 3L, 1L), .Label = c("", "F", "T", "T "), class = "factor"), 
  Box = structure(c(3L, 3L, 3L, 3L, 3L, 1L, 3L, 3L, 2L, 2L, 
  3L, 3L, 3L, 1L), .Label = c("", "in", "out"), class = "factor"), 
  ID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
  1L, 1L, 1L, 2L), .Label = c("", "hello"), class = "factor"), 
  Time = structure(1:14, .Label = c("10/1/2019 9:00:01 AM", 
 "10/1/2019 9:00:02 AM", "10/1/2019 9:00:03 AM", "10/1/2019 9:02:59 AM", 
 "10/1/2019 9:03:00 AM", "10/1/2019 9:05:00 AM", "10/1/2019 9:06:00 AM", 
 "10/1/2019 9:06:02 AM", "10/1/2019 9:07:00 AM", "10/1/2019 9:07:02 AM", 
 "10/1/2019 9:07:04 AM", "10/1/2019 9:07:05 AM", "10/1/2019 9:07:06 AM", 
 "10/1/2019 9:07:08 AM"), class = "factor")), class = "data.frame", row.names = c(NA, 
 -14L))

I think overall, I would have to create a loop. I believe I have the thought process correct, just unsure of how to formulate the code. This is what I am trying:

 df2 <- mutate(df,
      Date = lubridate::mdy_hms(Date))




   for ( i in 2:nrow(df2))
    {
  if(df2$Read[[i]] == 'T')

     }

I think this may be a start (just placing my conditions within the loop, I am not sure how to complete this)

Any suggestion is appreciated.

Ronak Shah · Accepted Answer · 2020-02-05T04:19:44.993

You can do this without loop. Using dplyr since it is easy to do multiple things using pipes.

We first convert Time column to POSIXct class, create a cond column which gives logical values based on the conditions we want to check, create a column to create groups using cumulative sum of cond column. Keep only the rows which satisfies the condition and get first and last value of Time along with the difference in between them for each group.

library(dplyr)

df %>%
  mutate(Time = lubridate::mdy_hms(Time), 
         cond = Read == "T" & Box == "out" & ID == "", 
         grp = cumsum(!cond)) %>%
  filter(cond) %>%
  group_by(grp) %>%
  summarise(starttime = first(Time), 
            endtime = last(Time), 
            duration = difftime(endtime, starttime, units = "secs")) %>%
  select(-grp)

# A tibble: 3 x 3
#  starttime           endtime             duration
#  <dttm>              <dttm>              <drtn>  
#1 2019-10-01 09:00:01 2019-10-01 09:03:00 179 secs
#2 2019-10-01 09:06:00 2019-10-01 09:06:02   2 secs
#3 2019-10-01 09:07:04 2019-10-01 09:07:06   2 secs

data

I have cleaned up your data a bit and used this as df.

df <- structure(list(Read = c("T", "T", "T", "T", "T", "F", "T", "T", 
"T", "T", "T", "T", "T", ""), Box = c("out", "out", "out", "out", 
"out", "", "out", "out", "in", "in", "out", "out", "out", "hello"
), ID = c("", "", "", "", "", "", "", "", "", "", "", "", "", 
""), Time = c("10/1/2019 9:00:01 AM", "10/1/2019 9:00:02 AM", 
"10/1/2019 9:00:03 AM", "10/1/2019 9:02:59 AM", "10/1/2019 9:03:00 AM", 
"10/1/2019 9:05:00 AM", "10/1/2019 9:06:00 AM", "10/1/2019 9:06:02 AM", 
"10/1/2019 9:07:00 AM", "10/1/2019 9:07:02 AM", "10/1/2019 9:07:04 AM", 
"10/1/2019 9:07:05 AM", "10/1/2019 9:07:06 AM", "10/1/2019 9:07:08 AM"
)), row.names = c(NA, -14L), class = "data.frame")

This works fine! So for my understanding, whenever you wish to put a condition within the code using dplyr, you can utilize the 'cond' function? — Lynn, Feb 05 '20 at 04:20
`cond` is just a placeholder column here which tells us if the condition we are checking for is satisfied or not for that row. If you need to add more conditions you can add there appending them with `&`. — Ronak Shah, Feb 05 '20 at 04:21
Hello, @Ronak, can I add this to the code to filter a certain value? cond = Subject == "^RE|FWD|FW", ignore.case = TRUE as well as a different scenario of cond = Subject !== "^RE|FWD|FW", ignore.case = TRUE — Lynn, Feb 05 '20 at 07:37
If you are looking for a pattern, you cannot use `==` or `!=`, use `grepl`. — Ronak Shah, Feb 05 '20 at 07:49

Create start and endtime columns based on multiple conditions in R (dplyr, lubridate)

1 Answers1

Linked