Create timestamp columns based on multiple filtering conditions (R, dplyr)

Question

I have a dataset, df

 Read      Box       ID      Time                             Subject 
 T         out               10/1/2019 9:00:01 AM
 T         out               10/1/2019 9:00:02 AM             Re:
 T         out               10/1/2019 9:00:03 AM             Re:
 T         out               10/1/2019 9:02:59 AM             Re:
 T         out               10/1/2019 9:03:00 AM
 F                           10/1/2019 9:05:00 AM
 T         out               10/1/2019 9:06:00 AM             Fwd:
 T         out               10/1/2019 9:06:02 AM             Fwd:
 T         in                10/1/2019 9:07:00 AM
 T         in                10/1/2019 9:07:02 AM
 T         out               10/1/2019 9:07:04 AM
 T         out               10/1/2019 9:07:05 AM             Fw:
 T         out               10/1/2019 9:07:06 AM             Fw:
           hello             10/1/2019 9:07:08 AM

Based on certain conditions within this dataset, I would like to create a startime column and an endtime column.

I would like to create a 'starttime' when the following occurs: If the Subject columns first word begins with RE:, re, FWD, or FW (in a consecutive manner), Read == "T", Box == "out" and ID == ""

When the first instance of this condition occurs, a starttime will be generated. For example for this dataset, the starttime will be 10/1/2019 9:00:02 AM since this is where we see the desired conditions occurs first (Subject is either FW:, RE: or FWD, Read = T, Box = out and ID = "" ) However, the moment when anyone of these conditions is not true, an endtime will be created. So the first endtime would occur right before row 4, where the time is 10/1/2019 9:02:59 AM. My ultimate goal is to then create a duration column for this.

This is my desired output when including RE, Fwd, and Fw

  starttime                    endtime                     duration

  10/1/2019 9:00:02 AM        10/1/2019 9:02:59 AM         177 secs
  10/1/2019 9:06:00 AM        10/1/2019 9:06:02 AM         2 secs
  10/1/2019 9:07:05 AM        10/1/2019 9:07:06 AM         1 secs

Furthermore, how would I specify in a separate code to create the start and endtimes for these conditions: Read = T, Box = out, ID = "" and when the the first word in the subject column does not contain Re, Fwd, or Fw?

 Read      Box       ID      Time                             Subject 
 T         out               10/1/2019 9:00:01 AM
 T         out               10/1/2019 9:00:02 AM             Re:
 T         out               10/1/2019 9:00:03 AM             Re:
 T         out               10/1/2019 9:02:59 AM             Re:
 T         out               10/1/2019 9:03:00 AM
 F                           10/1/2019 9:05:00 AM
 T         out               10/1/2019 9:06:00 AM             Fwd:
 T         out               10/1/2019 9:06:02 AM             Fwd:
 T         in                10/1/2019 9:07:00 AM
 T         in                10/1/2019 9:07:02 AM
 T         out               10/1/2019 9:07:04 AM
 T         out               10/1/2019 9:07:05 AM             Fw:
 T         out               10/1/2019 9:07:06 AM             Fw:
           hello             10/1/2019 9:07:08 AM

This is my desired output when excluding RE, Fwd, and Fw

  starttime                    endtime                     duration

  10/1/2019 9:00:01 AM        10/1/2019 9:00:01 AM         0 secs
  10/1/2019 9:03:00 AM        10/1/2019 9:03:00 AM         0 secs
  10/1/2019 9:07:04 AM        10/1/2019 9:07:04 AM         0 secs

dput:

 structure(list(Read = structure(c(3L, 3L, 3L, 3L, 3L, 2L, 3L, 
3L, 3L, 3L, 4L, 4L, 3L, 1L), .Label = c("", "F", "T", "T "), class = "factor"), 
Box = structure(c(3L, 3L, 3L, 3L, 3L, 1L, 3L, 3L, 2L, 2L, 
3L, 3L, 3L, 1L), .Label = c("", "in", "out"), class = "factor"), 
ID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 2L), .Label = c("", "hello"), class = "factor"), 
Time = structure(1:14, .Label = c("10/1/2019 9:00:01 AM", 
"10/1/2019 9:00:02 AM", "10/1/2019 9:00:03 AM", "10/1/2019 9:02:59 AM", 
"10/1/2019 9:03:00 AM", "10/1/2019 9:05:00 AM", "10/1/2019 9:06:00 AM", 
"10/1/2019 9:06:02 AM", "10/1/2019 9:07:00 AM", "10/1/2019 9:07:02 AM", 
"10/1/2019 9:07:04 AM", "10/1/2019 9:07:05 AM", "10/1/2019 9:07:06 AM", 
"10/1/2019 9:07:08 AM"), class = "factor"), Subject = structure(c(1L, 
4L, 4L, 4L, 1L, 1L, 3L, 3L, 1L, 1L, 1L, 2L, 2L, 1L), .Label = c("", 
"Fw:", "Fwd:", "Re:"), class = "factor")), class = "data.frame", row.names = c(NA, 
-14L))

The code that was suggested works, I just would like to also incorporate the Subject column conditions as well:
Where Subject == FW, FWD, RE (ignore the upper/lowercase) and Where Subject is not equal to FW, FWD, Re (ignore upper/lowercase)

library(dplyr)

df %>%
mutate(Time = lubridate::mdy_hms(Time), 
cond = Read == "T" & Box == "out" & ID == "" & Subject == "FW" & Subject  == "FWD" & Subject == "RE" (ignore.case = TRUE)
grp = cumsum(!cond)) %>%
filter(cond) %>%
group_by(grp) %>%
summarise(starttime = first(Time), 
endtime = last(Time), 
duration = difftime(endtime, starttime, units = "secs")) %>%
select(-grp)

library(dplyr)

df %>%
mutate(Time = lubridate::mdy_hms(Time), 
cond = Read == "T" & Box == "out" & ID == "" & Subject! == "FW" & Subject! == "FWD" & Subject! == "RE" (ignore.case = TRUE)
grp = cumsum(!cond)) %>%
filter(cond) %>%
group_by(grp) %>%
summarise(starttime = first(Time), 
endtime = last(Time), 
duration = difftime(endtime, starttime, units = "secs")) %>%
select(-grp)

score 1 · Accepted Answer · answered Feb 11 '20 at 09:41

1

There is a whole part of your problem that was already answered in your other question (Create start and endtime columns based on multiple conditions in R (dplyr, lubridate)). I know this can be difficult but next time please try to reduce your problem to a smaller one by focusing on what you do not already know.

If you want to detect a substring, the best way is to use str_detect from the stringr package (part of the tidyverse):

library(tidyverse)
library(lubridate)
df %>%
  mutate(Time = mdy_hms(Time), 
         # cond = Read == "T" & Box == "out" & ID == "", #from the answer https://stackoverflow.com/a/60068929/3888000
         cond = Read == "T" & Box == "out" & ID == "" & str_detect(Subject, regex('FW|FWD|RE', ignore_case=TRUE)), #including those subjects
         # cond = Read == "T" & Box == "out" & ID == "" & !str_detect(Subject, regex('FW|FWD|RE', ignore_case=TRUE)), #excluding those subjects
         grp = cumsum(!cond)) %>%
  filter(cond) %>%
  group_by(grp) %>%
  summarise(starttime = first(Time), 
            endtime = last(Time), 
            duration = difftime(endtime, starttime, units = "secs")) %>%
  select(-grp)

This uses regular expressions (regex), which is a very good thing to learn. This one is very easy to read as it only has the OR (|) operator, but possibilities are limitless.

answered Feb 11 '20 at 09:41

Dan Chaltiel

7,811
5
47
92

Thank you I will try this! Is there also a way to incorporate a thresh within this code? say thresh <- 5 mutate(Time= lubridate::mdy_hms(Time), gap = c(0, diff(Time))) %>% group_by(grp = cumsum(gap > gap_threshold)) %>% summarise(begin = min(Time), end = max(Time), duration = difftime(end, begin, units = "secs")) – Lynn Feb 12 '20 at 17:30
@TanishaHudson This seems like another algorithm, which may therefore deserve another question. What is wrong with this code? It works in my console at least. – Dan Chaltiel Feb 12 '20 at 17:35
Trying to do this @Dan Chaltiel: thresh<-25 df %>% mutate(Time = mdy_hms(Time), cond = Read == "T" & Box == "out" & ID== "" & str_detect(Subject, regex('FW|FWD|RE', ignore_case=TRUE)), grp = cumsum(!cond)) %>% filter(cond) %>% gap = c(0, diff(Time)) %>% group_by(grp = cumsum(gap > thresh))%>% summarise(begin = min(Time), end = max(Time), duration = difftime(end, begin, units = "secs")) %>% select(-grp) – Lynn Feb 12 '20 at 18:16
@TanishaHudson bis :-) This seems like another algorithm, which may therefore deserve another question. What is wrong with this code? It works in my console at least. (you can use "`" for code in comments too) – Dan Chaltiel Feb 13 '20 at 09:33
Nothing is wrong with your code. It word well, I was also trying to incorporate a ‘thresh’ within it. Thank you! – Lynn Feb 13 '20 at 15:52
@TanishaHudson What is wrong with your new code, not mine – Dan Chaltiel Feb 14 '20 at 13:06
Yes, your code works well. I believe the syntax is incorrect within my code. I just would like to incorporate a thresh number within my code, so that it 'breaks' the time when the gap between two datetime stamps exceed 35 seconds. Ill keep trying – Lynn Feb 14 '20 at 19:14

Create timestamp columns based on multiple filtering conditions (R, dplyr)

1 Answers1