How to group time instances, find its duration and sum them based on specific conditions (in R or Python)

Question

Objective

I have a dataset that tracks when someone was editing a message, left, and came back to resume editing that same message. I am trying to group this one message together and find its total duration.

Problem

I need to first group the message together if the following conditions apply: if the Folder == 'out' OR drafts, the Message == "", Edit == "T", and if the contents in the Subject and Re columns are consecutively the same.

Ideally, this would give me group A along with its duration. For example, this first 'block' of data would be labeled 'Group A' , would have a start time of 1/2/2020 1:00:01 AM and an end time of 1/2/2020 1:00:30 AM

Furthermore, I would like to 'match' group A with another 'block' of data if the last row of the Subject, Re and Length column matches its first row. So the second Group A would have a start time of 1/2/2020 1:02:00 AM and an end time of 1/2/2020 1:02:05 AM.

Subject Re                    Length         Folder      Message   Date                   Edit     
        a@mail.com,b@mail.com 80             out                   1/2/2020 1:00:01 AM     T                               
        a@mail.com,b@mail.com 80             out                   1/2/2020 1:00:05 AM     T                        
hey     a@mail.com,b@mail.com 80             out                   1/2/2020 1:00:10 AM     T                        
hey     a@mail.com,b@mail.com 80             out                   1/2/2020 1:00:15 AM     T                        
hey     a@mail.com,b@mail.com 80             out                   1/2/2020 1:00:30 AM     T 



hey     a@mail.com,b@mail.com 80            draft                  1/2/2020 1:02:00 AM     T                        
hey     a@mail.com,b@mail.com 80            draft                  1/2/2020 1:02:05 AM     T                        




hey     a@mail.com,b@mail.com 80             out                   1/2/2020 1:03:10 AM     T                        
hey     a@mail.com,b@mail.com 80             out                   1/2/2020 1:03:20 AM     T

Desired Output

 Start                  End                        Duration          Group
 1/2/2020 1:00:10 AM    1/2/2020 1:00:30 AM        20                A
 1/2/2020 1:02:00 AM    1/2/2020 1:02:05 AM        5                 A
 1/2/2020 1:03:10 AM    1/2/2020 1:03:20 AM        10                A

dput:

 structure(list(Subject = structure(c(1L, 1L, 2L, 2L, 2L, 1L, 
 1L, 2L, 2L, 1L, 1L, 2L, 2L), .Label = c("", "hey"), class = "factor"), 
 Recipient = structure(c(1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 
 1L, 1L, 2L, 2L), .Label = c("", "sarah@mail.com,gee@mail.com"
 ), class = "factor"), Length = c(80L, 80L, 80L, 80L, 80L, 
 NA, NA, 80L, 80L, NA, NA, 100L, 100L), Folder = structure(c(3L, 
 3L, 3L, 3L, 3L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L), .Label = c("", 
 "draft", "out"), class = "factor"), Message = c(NA, NA, NA, 
 NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Date = structure(c(2L, 
 3L, 4L, 5L, 6L, 1L, 1L, 7L, 8L, 1L, 1L, 9L, 10L), .Label = c("", 
 "1/2/2020 1:00:01 AM", "1/2/2020 1:00:05 AM", "1/2/2020 1:00:10 AM", 
 "1/2/2020 1:00:15 AM", "1/2/2020 1:00:30 AM", "1/2/2020 1:02:00 AM", 
 "1/2/2020 1:02:05 AM", "1/2/2020 1:03:00 AM", "1/2/2020 1:03:20 AM"
 ), class = "factor"), Edit = c(TRUE, TRUE, TRUE, TRUE, TRUE, 
 NA, NA, TRUE, TRUE, NA, NA, TRUE, TRUE)), class = "data.frame", row.names = c(NA, 
-13L))

I am thinking I may have to do an if/else statement or some sort of filter. Any help/suggestion or advice would be greatly appreciated.

  library(tidyverse)
  library(lubridate)



  df1<-df2 %>%
  mutate(Date = lubridate::mdy_hms(Date), 
  cond = Edit == "T" & ItemFolderName == "out" | Folder == "drafts" &     Message == "" & Subject ==  ?   & Re ==   ?     & Length == ?   , 
     grp = cumsum(!cond)) %>%
  filter(cond) %>%
  group_by(grp) %>%

I think you're on the right track with your code at the bottom. There's a couple issues there though, first `Edit == "T"` won't work since `Edit` is logical and you're comparing it to the character `"T"`. Second, the way you entered data Message is `NA` if blank, not empty character. So you either need to convert `NA` to `""` or use `is.na(Message)`. I don't really understand how you mean to define the Groups, and since your data example has all the same Subject RE and Length I don't get the logic you're trying to build — astrofunkswag, Feb 27 '20 at 19:31
ok thank you @astrofunkswag, basically the message is being timed based on how long you stay editing it. So the very first 'block' is just a message that someone was editing from 1:00:01 AM to 1:00:30 AM. I can see that this is the same message because the subject, recipients and length is all the same. Furthermore, while editing this message, the user steps away from the PC and then resumes, this is where the message start time picks up again at 1/2/2020 1:02:00 AM — Lynn, Feb 27 '20 at 19:37
The `Subject` is blank for the first two rows of the dataframe, should that be a different group or should the subject be `hey`? — astrofunkswag, Feb 27 '20 at 19:41
Hi @astrofunkswag yes if there is a blank, it wont get counted, so the first time starts at 1/2/2020 1:00:10 AM — Lynn, Feb 27 '20 at 19:43
I just shortened the word True to T for space sake here. In my original data, Edit column is "True" — Lynn, Feb 27 '20 at 19:48

score 1 · Accepted Answer · answered Feb 27 '20 at 21:00

1

The solution is going to be something like this. There are some differences between the data you provided and your desired solution, the column types are a little different (like Edit is provided as a logical). Also, the rows are a little different so this won't be the same as your desired output (you provide a length of 100 in the data, not in your example)

I used a function from this post to assign Group labels. It gives a number you could modify to make it a letter if you want, but you need to handle what would happen if you have over 26 groups.

df <- df %>% 
  mutate(Message = ifelse(is.na(Message), '', Message)) %>% 
  mutate(
    cond = Edit & Folder %in% c('out', 'draft') & Message == ''
  ) %>% 
  mutate(grp = cumsum(!cond)) %>% 
  filter(cond, Subject != '')

df$Date <- as.POSIXct(as.character(df$Date), format = '%m/%d/%Y %H:%M:%OS')

get_group_number = function(){
  i = 0
  function(){
    i <<- i+1
    i
  }
}
group_number = get_group_number()
df <- df %>% group_by(Subject, Recipient, Length) %>% mutate(Group = group_number())

df %>% group_by(grp) %>% 
   summarise(Start = min(Date), End = max(Date),
             Duration = End - Start, Group = unique(Group)[1])
# A tibble: 3 x 5
    grp Start               End                 Duration Group
  <int> <dttm>              <dttm>              <drtn>   <dbl>
1     0 2020-01-02 01:00:10 2020-01-02 01:00:30 20 secs      1
2     2 2020-01-02 01:02:00 2020-01-02 01:02:05  5 secs      1
3     4 2020-01-02 01:03:00 2020-01-02 01:03:20 20 secs      2

I think you might also want to consider how you define groups, since it seems like you could have the same Subject, Recipient and Length by chance alone

answered Feb 27 '20 at 21:00

astrofunkswag

2,608
12
25

Thank you- I will try this. Is there a way to actually show the 'subject' in the output?? – Lynn Feb 27 '20 at 21:02
When you are saying 'Group by(subject, recipient and length), does this mean, essentially we are placing the block into 'groups' if the subject, recipient and Length are the same? – Lynn Feb 27 '20 at 21:03
You can add the Subject in the same way I added Group, since there should be a number of identical subject values and you can just take the first. Yeah the `group_by` would correspond to how you defined Group A if I understood it correctly. – astrofunkswag Feb 27 '20 at 21:09
thank you so much @astrofunkswag, I am trying to understand why the last row of your output is not in Group 1 as well and why the seconds are 20 and not 10.Furthermore, I can display the Subject field by incorporating: df <- df %>% group_by(Subject, Recipient, Length) %>% mutate(Subject = group_number())? – Lynn Feb 27 '20 at 21:17
Is there a way to filter if the Subject is not consecutively the same? @astrofunkswag - I am testing your code now. Thank you – Lynn Feb 27 '20 at 21:29
because the example data you provided in your dput was different that your example. Run the dput data and you'll see the last length is 100 so it's a different group – astrofunkswag Feb 27 '20 at 21:53
I see thank you @astrofunkswag, this is a great start, I now need to be able to actually display the subject, so I know it is the same message. I know you mention to do it the same way you did Group, but its not outputting for me. thank you! – Lynn Feb 27 '20 at 21:59
Also @astrofunkswag, is there a way to not filter the blank subject? I suppose I would just remove filter(cond, Subject != '') remove the subject! Furthermore to reveal the subject, I would add this? mutate(Subject = group_number()) ?? Sorry I’m still learning – Lynn Feb 29 '20 at 08:04
1

To view the subject in the final output it you would add to the summarize `Subject = unique(Subject)[1]`, just like for Group. Confusing because Group is one of your column names and also a function you're using. Good coding practice maybe change the column name to keep it clear – astrofunkswag Mar 02 '20 at 17:37
1

You could take the `Subject != ""` out of the filter, but you said in a previous comment that if Subject is blank it won't get counted so that is what I was trying to match – astrofunkswag Mar 02 '20 at 17:38

How to group time instances, find its duration and sum them based on specific conditions (in R or Python)

1 Answers1