How to calculate difference between timestamps given some conditions?

Question

I would like to determine the difference between timestamps regading a userID. Here I just want to measure the difference between those users who have a login und logout status. There are some users haven only logout our login status. For them I just would like to mark dem as NA:

Some data:

  library(dplyr)
  start <- as.POSIXct("2012-01-15")
  interval <- 70
  end <- start + as.difftime(1, units="days")
  tseq<- seq(from=start, by=interval*70, to=end)
  employeID <-c("1_e","1_e","2_b","2_b","3_c","3_c","100_c","4_d","4_d","52_f","9_f","9_f","7_u","7_u","10_5","22_2","33_a","33_a")
  status<- c("login","logout","login","logout","login","logout","login","logout","login","logout","login","logout","login","logout","logout","login","logout","login")
  # put together
  data <- data.frame(tseq, employeID, status)

           tseq            employeID   status
  #1  2012-01-15 00:00:00       1_e  login
  #2  2012-01-15 01:21:40       1_e logout
  #3  2012-01-15 02:43:20       2_b  login
  #4  2012-01-15 04:05:00       2_b logout
  #5  2012-01-15 05:26:40       3_c  login
  #6  2012-01-15 06:48:20       3_c logout
  #7  2012-01-15 08:10:00     100_c  login
  #8  2012-01-15 09:31:40       4_d logout
  #9  2012-01-15 10:53:20       4_d  login
  #10 2012-01-15 12:15:00      52_f logout
  #11 2012-01-15 13:36:40       9_f  login
  #12 2012-01-15 14:58:20       9_f logout
  #13 2012-01-15 16:20:00       7_u  login
  #14 2012-01-15 17:41:40       7_u logout
  #15 2012-01-15 19:03:20      10_5 logout
  #16 2012-01-15 20:25:00      22_2  login
  #17 2012-01-15 21:46:40      33_a logout
  #18 2012-01-15 23:08:20      33_a  login  


  test<- data %>% 
    group_by(employeID) %>% 
    mutate(time.difference = tseq - lag(tseq))

However, that seems only to produce a time.difference constant

It is because you've created a constant in your example. Each timestamp have an interval of `interval*70` as you defined in `seq`. So `70*70 = 1.3611 hours`. There is no problem. — cderv, Nov 03 '16 at 09:00
Please don't let your question stagnate. You've got a few answers, and more comments, but you haven't responded to any of them. At a minimum, come back and let us know what worked or didn't work. — rosscova, Nov 07 '16 at 03:38

rosscova · Accepted Answer · 2016-11-03T09:22:30.407

How about this. Mainly, it looks like you're using mutate when you want summarise. Also, I've converted the status column from factor to character, and included an ifelse statement to only take the users with both "login" and "logout" entries:

test <- data %>% 
    mutate( status = as.character( status ) ) %>%
    group_by( employeID ) %>% 
    summarise( time.difference = ifelse( "login" %in% status && "logout" %in% status, 
                                         difftime( tseq[ status == "logout" ], tseq[ status == "login" ] ), 
                                         NA ) 
    )

Which gives:

> head( test )
# A tibble: 6 × 2
employeID time.difference
      <fctr>           <dbl>
1       1_e        1.361111
2      10_5              NA
3     100_c              NA
4       2_b        1.361111
5      22_2              NA
6       3_c        1.361111

As others have suggested, your data does contain constant intervals of time, so wherever there is a relevant value, it's always the same. I assume your actual data looks a little different, so you'll get more sensical output.

I modified your example to include only those `employeID`'s which have both "login" and "logout": `data %>% group_by(employeID) %>% dplyr::filter("login" %in% status & "logout" %in% status) %>% group_by(employeID) %>% dplyr::summarise(interval = difftime(tseq[status == "logout"], tseq[status == "login"]))` — seasmith, Nov 03 '16 at 15:05
The OP wanted to keep them, marked as NA: "There are some users haven only logout our login status. For them I just would like to mark dem as NA:". That's why I wrote the `ifelse` that way. — rosscova, Nov 03 '16 at 19:45

score 1 · Answer 2 · answered Nov 03 '16 at 09:04

We first filter groups which have unpaired status by checking count for each group. With dplyr::do we then calculate time difference for each group

 library(dplyr)
  start <- as.POSIXct("2012-01-15")
  interval <- 70
  end <- start + as.difftime(1, units="days")
  tseq<- seq(from=start, by=interval*70, to=end)
  employeID <-c("1_e","1_e","2_b","2_b","3_c","3_c","100_c","4_d","4_d","52_f","9_f","9_f","7_u","7_u","10_5","22_2","33_a","33_a")
  status<- c("login","logout","login","logout","login","logout","login","logout","login","logout","login","logout","login","logout","logout","login","logout","login")
  # put together
  DF <- data.frame(tseq, employeID, status)

           tseq            employeID   status
  #1  2012-01-15 00:00:00       1_e  login
  #2  2012-01-15 01:21:40       1_e logout
  #3  2012-01-15 02:43:20       2_b  login
  #4  2012-01-15 04:05:00       2_b logout
  #5  2012-01-15 05:26:40       3_c  login
  #6  2012-01-15 06:48:20       3_c logout
  #7  2012-01-15 08:10:00     100_c  login
  #8  2012-01-15 09:31:40       4_d logout
  #9  2012-01-15 10:53:20       4_d  login
  #10 2012-01-15 12:15:00      52_f logout
  #11 2012-01-15 13:36:40       9_f  login
  #12 2012-01-15 14:58:20       9_f logout
  #13 2012-01-15 16:20:00       7_u  login
  #14 2012-01-15 17:41:40       7_u logout
  #15 2012-01-15 19:03:20      10_5 logout
  #16 2012-01-15 20:25:00      22_2  login
  #17 2012-01-15 21:46:40      33_a logout
  #18 2012-01-15 23:08:20      33_a  login  


  testDF<- DF %>% 
    dplyr::group_by(employeID) %>%
    dplyr::filter(count(unique(status)) > 1 ) %>% 
    dplyr::do(.,data.frame(logINTime =.$tseq[.$status=="login"],logOUTTime =.$tseq[.$status=="logout"],
    deltaTime=difftime(.$tseq[.$status=="logout"],.$tseq[.$status=="login"],units="secs"))) %>%
    as.data.frame()


testDF
  # employeID           logINTime          logOUTTime deltaTime
# 1       1_e 2012-01-15 00:00:00 2012-01-15 01:21:40      4900
# 2       2_b 2012-01-15 02:43:20 2012-01-15 04:05:00      4900
# 3       3_c 2012-01-15 05:26:40 2012-01-15 06:48:20      4900
# 4      33_a 2012-01-15 23:08:20 2012-01-15 21:46:40     -4900
# 5       4_d 2012-01-15 10:53:20 2012-01-15 09:31:40     -4900
# 6       7_u 2012-01-15 16:20:00 2012-01-15 17:41:40      4900
# 7       9_f 2012-01-15 13:36:40 2012-01-15 14:58:20      4900

this gives the same result as OP's approach, except considerably longer. — mtoto, Nov 03 '16 at 09:07

score 0 · Answer 3 · answered Nov 03 '16 at 09:03

0

This line seems to create a constant time interval:

tseq<- seq(from=start, by=interval*70, to=end)

So when you take the difference again, wouldn't it be constant?

answered Nov 03 '16 at 09:03

lao_zhang

178
8

I agree, but may I suggest making this a comment, rather than an answer? – rosscova Nov 03 '16 at 09:10

How to calculate difference between timestamps given some conditions?

3 Answers3

Linked