0

I am running into a possible aggregate/summarise problem with this time series dataset.

Below is a dataset containing unique instances of a chat from a internet chatting system.

chatId  agentId beginning_timestamp ending_timestamp    answer
    1   22      4/07/2016 9:00      4/07/2016 9:30      1
    2   22      4/07/2016 9:26      4/07/2016 9:35      5
    3   22      4/07/2016 9:15      4/07/2016 9:19      5
    4   10      4/07/2016 11:17     4/07/2016 12:13     2
    5   10      4/07/2016 11:29     4/07/2016 11:50     1
  • The unique id field is chatId.
  • agentId is a unique agent who received this chat
  • beginning_timestamp is the time when the chat started
  • ending_timestamp is the time when the chat ended
  • answer is a continous variable used for later analysis

I would like to use dplyr (or anything else) to summarise the following dataset in this manner:

  • For every group(agentId and chatId), find out if there is another chat that occurred within the same timeframe (from beginning_timestamp plus 5 minutes to ending_timestamp - 5 minutes)
  • If there is a chat write out the fact that this unique chat occured while there was another chat ongoing.

Here is a resulting dataset with an additional column(multiple_chats = a discrete variable will do) that I would like to make:

    chatId  agentId beginning_timestamp ending_timestamp    answer   multiple_chats
        1   22      4/07/2016 9:00      4/07/2016 9:30      1          1
        2   22      4/07/2016 9:26      4/07/2016 9:35      5          0
        3   22      4/07/2016 9:15      4/07/2016 9:19      5          1
        4   10      4/07/2016 11:17     4/07/2016 12:13     2          1
        5   10      4/07/2016 11:29     4/07/2016 11:50     1          1

Any answers are appreciated.

treeof
  • 63
  • 8

1 Answers1

1

Looks like some of your overlaps may be different than what you show in your "resulting dataset", but here is a stab at it with some input from here

df <- data.frame(chatID = 1:5, agentID = c(22,22,22,10,10), 
             beginning_timestamp = c('4/07/2016 9:00', '4/07/2016 9:26', '4/07/2016 9:15', '4/07/2016 11:17', '4/07/2016 11:29'),
             ending_timestamp = c('4/07/2016 9:30', '4/07/2016 9:35', '4/07/2016 9:19', '4/07/2016 12:13', '4/07/2016 11:50'),
             answer = c(1,5,5,2,1))

l

ibrary(tidyverse)
library(lubridate)

df %>% 
  mutate(beginning_timestamp = mdy_hm(beginning_timestamp),
         ending_timestamp = mdy_hm(ending_timestamp),
         int = interval(beginning_timestamp + minutes(5), 
                        ending_timestamp - minutes(5))) -> df

df$multiple_chats = unlist(tapply(df$int, df$agentID, 
                           function(x) as.numeric(rowSums(outer(x, x, int_overlaps))>1)))
B Williams
  • 1,992
  • 12
  • 19
  • A quick update to all, I was having issues with my dataset using this idea, but realised that this works only if the dataframe is sorted using agentid. df <- df[order(df$agentId),] before running unlist(tapply(df$int, df$agentID ..... Ref: https://stackoverflow.com/questions/44200364/find-dates-within-a-period-interval-by-group – treeof Dec 17 '17 at 04:26