2

I have a data set with a structure such as this:

structure(list(id = c(43956L, 46640L, 71548L, 71548L, 71548L, 
72029L, 72029L, 74558L, 74558L, 100596L, 100596L, 100596L, 104630L, 
104630L, 104630L, 104630L, 104630L, 104630L, 104630L, 104630L
), event = c("LOGIN", "LOGIN", "LOGIN", "LOGIN", "LOGOUT", "LOGIN", 
"LOGOUT", "LOGIN", "LOGOUT", "LOGIN", "LOGOUT", "LOGIN", "LOGIN", 
"LOGIN", "LOGIN", "LOGIN", "LOGIN", "LOGOUT", "LOGIN", "LOGOUT"
), timestamp = c("2017-03-27 09:19:29", "2016-06-10 00:09:08", 
"2016-01-27 12:00:25", "2016-06-20 11:34:29", "2016-06-20 11:35:44", 
"2016-12-28 10:43:25", "2016-12-28 10:56:30", "2016-10-15 15:08:39", 
"2016-10-15 15:10:06", "2016-03-09 14:30:48", "2016-03-09 14:31:10", 
"2017-04-03 10:36:54", "2016-01-11 16:52:08", "2016-02-03 14:40:32", 
"2016-03-30 12:34:56", "2016-05-26 13:14:25", "2016-08-22 15:20:02", 
"2016-08-22 15:21:53", "2016-08-22 15:22:23", "2016-08-22 15:23:08"
)), .Names = c("id", "event", "timestamp"), row.names = c(5447L, 
5446L, 5443L, 5444L, 5445L, 5441L, 5442L, 5439L, 5440L, 5436L, 
5437L, 5438L, 5425L, 5426L, 5427L, 5428L, 5429L, 5430L, 5431L, 
5432L), class = "data.frame")

         id  event           timestamp
5447  43956  LOGIN 2017-03-27 09:19:29
5446  46640  LOGIN 2016-06-10 00:09:08
5443  71548  LOGIN 2016-01-27 12:00:25
5444  71548  LOGIN 2016-06-20 11:34:29
5445  71548 LOGOUT 2016-06-20 11:35:44
5441  72029  LOGIN 2016-12-28 10:43:25
5442  72029 LOGOUT 2016-12-28 10:56:30
5439  74558  LOGIN 2016-10-15 15:08:39
5440  74558 LOGOUT 2016-10-15 15:10:06
5436 100596  LOGIN 2016-03-09 14:30:48
5437 100596 LOGOUT 2016-03-09 14:31:10
5438 100596  LOGIN 2017-04-03 10:36:54
5425 104630  LOGIN 2016-01-11 16:52:08
5426 104630  LOGIN 2016-02-03 14:40:32
5427 104630  LOGIN 2016-03-30 12:34:56
5428 104630  LOGIN 2016-05-26 13:14:25
5429 104630  LOGIN 2016-08-22 15:20:02
5430 104630 LOGOUT 2016-08-22 15:21:53
5431 104630  LOGIN 2016-08-22 15:22:23
5432 104630 LOGOUT 2016-08-22 15:23:08

I wish to calculate the time difference between LOGIN and LOGOUT (session duration) as well as between LOGOUT and LOGIN (session interval). Unfortunately, I have LOGIN events that do not have a matching LOGOUT event.

The correct LOGOUT event always follows its' corresponding LOGIN event (as I ordered the data frame based on id and timestamp. I tried adapting this answer, but have had no luck. I also tried creating an event identifier, but since I can't find a way to get the numbering for the LOGOUT event to match the numbering for the LOGIN event, I am unsure as to how useful such an identifier will be:

df$eventNum <- as.numeric(ave(as.character(df$id), df$id, as.character(df$event), FUN = seq_along))
kneijenhuijs
  • 1,189
  • 1
  • 12
  • 21
  • Have you tried using the lag function inside mutate on the dplyr package? If you order by id and date, and then group by id it should work – comendeiro Aug 31 '17 at 10:11
  • If I do this, I create a lagged timestamp as well as event. The issue I run into is then creating a variable that defines whether the difference between current timestamp en previous timestamp is part of a session duration or interval. I can fix this with an `ifelse()` statement: `statistic=ifelse(event==lastevent, NA, ifelse(event=="LOGIN", "duration", "interval"))`. However, this crosses between `id` levels. – kneijenhuijs Aug 31 '17 at 10:36

2 Answers2

2

Assuming that any user will stay logged in indefinitely until logs out, it seems the data can be ordered in a way so that a simple "lag" function will do the trick.

Using the library dplyr and assuming that you've called your dataframe "df" and you have already converted the timestamp to a date format such as POSIXct:

df %>% arrange(id,timestamp) %>%
  group_by(id,event)%>%
  mutate(rank = dense_rank(timestamp)) %>%
  ungroup() %>%
  arrange(id, rank,timestamp) %>%
  group_by(id)%>%
  mutate(duration = ifelse(event == "LOGOUT", timestamp- lag(timestamp),NA))

Line by line.

First, we order the data by "id" and "timestamp" and we group by "id" and "event" to assign the rank of the login and logout events. The First login for the same user will have the "rank" 1 and the first log out for that user will also have the "rank" 1.

df %>% arrange(id,timestamp) %>%
  group_by(id,event)%>%
  mutate(rank = dense_rank(timestamp))

Then, we remove the groupings of the data and we sort again by id, rank and timestamp. This will yield a dataframe with the right order, with the LOGIN events followed by LOGOUT events for each user, so we can apply a lag calculation.

  ungroup() %>%
  arrange(id, rank,timestamp) %>%

Finally, we group again by "id" and we use mutate to calculate the lag of the timestamps only for the LOGOUT events.

  group_by(id)%>%
  mutate(duration = ifelse(event == "LOGOUT", timestamp- lag(timestamp),NA))

That should yield a dataframe such as:

id  event           timestamp  rank     duration
    <int>  <chr>              <dttm> <int>        <dbl>
1   43956  LOGIN 2017-03-27 09:19:29     1           NA
2   46640  LOGIN 2016-06-10 00:09:08     1           NA
3   71548  LOGIN 2016-01-27 12:00:25     1           NA
4   71548 LOGOUT 2016-06-20 11:35:44     1 208715.31667
5   71548  LOGIN 2016-06-20 11:34:29     2           NA
6   72029  LOGIN 2016-12-28 10:43:25     1           NA
7   72029 LOGOUT 2016-12-28 10:56:30     1     13.08333
8   74558  LOGIN 2016-10-15 15:08:39     1           NA
9   74558 LOGOUT 2016-10-15 15:10:06     1      1.45000
10 100596  LOGIN 2016-03-09 14:30:48     1           NA
11 100596 LOGOUT 2016-03-09 14:31:10     1     22.00000
comendeiro
  • 816
  • 7
  • 14
2

Here's the approach I'd take:

First, I'd convert the event variable to an ordered factor, because it makes sense to think of its values this way (i.e. Login < Logout, in terms of order), and because it will enable easier comparison between rows:

df$event <- factor(df$event, levels = c("LOGIN", "LOGOUT"), ordered = T)

Then, assuming that timestamp is in a viable format, as this would provide:

df$timestamp <- lubridate::parse_date_time(df$timestamp, "%Y-%m-%d %H:%M:%S")

You can conditionally mutate your data.frame by grouping by ID and then calling mutate with ifelse functions:

df %>% group_by(id) %>% mutate(
  timeElapsed = ifelse(event != lag(event), lubridate::seconds_to_period(timestamp - lag(timestamp)), NA),
  eventType = ifelse(event > lag(event), 'Duration', ifelse(event < lag(event), 'Interval', NA))
  )
#        id  event           timestamp timeElapsed eventType
#     <int>  <ord>              <dttm>       <dbl>     <chr>
#  1  43956  LOGIN 2017-03-27 09:19:29          NA      <NA>
#  2  46640  LOGIN 2016-06-10 00:09:08          NA      <NA>
#  3  71548  LOGIN 2016-01-27 12:00:25          NA      <NA>
#  4  71548  LOGIN 2016-06-20 11:34:29          NA      <NA>
#  5  71548 LOGOUT 2016-06-20 11:35:44     1.25000  Duration
#  6  72029  LOGIN 2016-12-28 10:43:25          NA      <NA>
#  7  72029 LOGOUT 2016-12-28 10:56:30    13.08333  Duration
#  8  74558  LOGIN 2016-10-15 15:08:39          NA      <NA>
#  9  74558 LOGOUT 2016-10-15 15:10:06     1.45000  Duration
# 10 100596  LOGIN 2016-03-09 14:30:48          NA      <NA>
# 11 100596 LOGOUT 2016-03-09 14:31:10    22.00000  Duration
# 12 100596  LOGIN 2017-04-03 10:36:54    44.00000  Interval
# 13 104630  LOGIN 2016-01-11 16:52:08          NA      <NA>
# 14 104630  LOGIN 2016-02-03 14:40:32          NA      <NA>
# 15 104630  LOGIN 2016-03-30 12:34:56          NA      <NA>
# 16 104630  LOGIN 2016-05-26 13:14:25          NA      <NA>
# 17 104630  LOGIN 2016-08-22 15:20:02          NA      <NA>
# 18 104630 LOGOUT 2016-08-22 15:21:53    51.00000  Duration
# 19 104630  LOGIN 2016-08-22 15:22:23    30.00000  Interval
# 20 104630 LOGOUT 2016-08-22 15:23:08    45.00000  Duration

Using lubridate::seconds_to_period will give you the time difference in "%d %H %M %S" format.

cmaher
  • 5,100
  • 1
  • 22
  • 34
  • When I run your exact code in my session, it does not properly group by `id`, leading to durations and intervals being calculated across `id` values. Any idea what could be going wrong? – kneijenhuijs Aug 31 '17 at 11:37
  • 1
    Hmm... it could be a namespace collision, as described in [this answer](https://stackoverflow.com/questions/26923862/why-are-my-dplyr-group-by-summarize-not-working-properly-name-collision-with). Do you perhaps have another package loaded with a group_by function? In any case, can you try using `dplyr::group_by()`? – cmaher Aug 31 '17 at 11:43
  • That doesn't seem to be it. I am completely confused, as this also happened earlier when I was experimenting by myself. When I delete the `group_by` statement, I get the exact same output as when it is included. I checked the structure of my data frame for differences in variable types from yours, but they are all exactly the same. Also nothing interfering in the session, as the same issue occurs after restarting the session and running only the code necessary for the example. – kneijenhuijs Aug 31 '17 at 11:51
  • 1
    Here are a few other long shots that come to mind: 1) try reloading R, but instead of loading tidyverse, just load dplyr & lubridate, and then run your code 2) try renaming the `id` variable to something else and see if the grouping fails anyway (this is a real long shot) and 3) check your dplyr & tiyverse versions (I'm using 0.7.2 and 1.1.11) to see if you're perhaps using older package versions. – cmaher Aug 31 '17 at 12:11
  • One other idea: test this `df %>% group_by(id) %>% mutate(test = seq(n()))` on your data.frame to see if it properly returns row numbering for your grouped rows. – cmaher Aug 31 '17 at 12:19
  • Only loading dplyr & lubridate seems to work (I also ended up updating, since you reminded me I hadn't done that for a couple of weeks, but that didn't make a difference). Strange, but at least it works. Thanks a bunch! – kneijenhuijs Aug 31 '17 at 12:52