Group by time proximity between adjecent rows

Question

I have dataset like this

      id  data                time moreData
   <int> <int>              <dttm>    <dbl>
 1     1     4 2017-05-12 18:34:20     4450
 2     2     4 2017-05-12 18:37:07     2800
 3     3     4 2017-05-12 18:37:10     1900
 4     4     4 2017-05-12 18:37:59     1950
 5     5     4 2017-05-12 18:38:40     2500

containing timestamps. You could say this data are "requests to a website" and i want to approximate "sessions".

In other words, I wish to group the rows 1, 2 , ... , n in groups, if time difference between row i and i+1 is less then let's say less than 1 minute.

Therefor, the data would be grouped in {1} and {2,3,4,5}.

Please note this is not a duplicate question of other questions asking about grouping in predetermined time intervals - I do not care how big the time difference between first and last element is, I care only about difference in adjacent rows.

How can I achieve this?

Sample data:

structure(list(id = 1:20, user = c(4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), time = structure(c(1494606860, 
1494607027, 1494607030, 1494607172, 1494607173, 1494607197, 1494607198, 
1494607200, 1494607309, 1494607312, 1494607339, 1494607340, 1494607343, 
1494607343, 1494607404, 1494607405, 1494607407, 1494607492, 1494607493, 
1494607495), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = c("id", 
"user", "time"), row.names = c(NA, -20L), class = c("tbl_df", 
"tbl", "data.frame"))

The whole set is several megabytes and contains a lot of non-relevant data I should not share without permission. Is there something you hope to find out from it I could maybe retrieve by myself? Or do you want a bigger sample to test your answer on? — Kiraa, Feb 07 '18 at 22:13
It's generally nice to provide even sample data in a form that others can easily copy into their own consoles, as opposed to pictures of data or table data that isn't amenable to `read_table` or similar. See the faq on [reproducible examples](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) — Calum You, Feb 07 '18 at 22:15
Thanks for additional data. My solution has to exclude few columns to fit new data but calculation matches. interesting question. — MKR, Feb 07 '18 at 23:03

pogibas · Accepted Answer · 2018-02-07T22:55:50.350

You can use difftime function from base R.

Code:

# Wanted time difference in minutes
wantedDiff <- 1
timeDiff <- abs(difftime(df$time[-nrow(df)], 
                         df$time[-1], 
                         units = "mins"))
df$group <- cumsum(c(0, as.numeric(timeDiff >= wantedDiff)))

Result:

   id user                time group
1   1    4 2017-05-12 19:34:20     0
2   2    4 2017-05-12 19:37:07     1
3   3    4 2017-05-12 19:37:10     1
4   4    4 2017-05-12 19:39:32     2
5   5    4 2017-05-12 19:39:33     2
6   6    4 2017-05-12 19:39:57     2
7   7    4 2017-05-12 19:39:58     2
8   8    4 2017-05-12 19:40:00     2
9   9    4 2017-05-12 19:41:49     3
10 10    4 2017-05-12 19:41:52     3
11 11    4 2017-05-12 19:42:19     3
12 12    4 2017-05-12 19:42:20     3
13 13    4 2017-05-12 19:42:23     3
14 14    4 2017-05-12 19:42:23     3
15 15    4 2017-05-12 19:43:24     4
16 16    4 2017-05-12 19:43:25     4
17 17    4 2017-05-12 19:43:27     4
18 18    4 2017-05-12 19:44:52     5
19 19    4 2017-05-12 19:44:53     5
20 20    4 2017-05-12 19:44:55     5

Explanation:

Calculate absolute difference in time between current and previous row using difftime
- We can specify difference units here
- Output (timeDiff) looks like this:

Time differences in mins
 [1] 2.78333333 0.05000000 2.36666667 0.01666667 0.40000000 0.01666667 0.03333333 1.81666667 0.05000000 0.45000000
[11] 0.01666667 0.05000000 0.00000000 1.01666667 0.01666667 0.03333333 1.41666667 0.01666667 0.03333333

Test if difference in time is greater or equal to wantedDiff and transform this logical output to numeric
cumsum that numeric output (adds +1, ie switches to new group)

Data:

df <- structure(list(id = 1:20, user = c(4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), time = structure(c(1494606860, 
1494607027, 1494607030, 1494607172, 1494607173, 1494607197, 1494607198, 
1494607200, 1494607309, 1494607312, 1494607339, 1494607340, 1494607343, 
1494607343, 1494607404, 1494607405, 1494607407, 1494607492, 1494607493, 
1494607495), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = c("id", 
"user", "time"), row.names = c(NA, -20L), class = c("tbl_df", 
"tbl", "data.frame"))

Sorry mate. Our approach matched but you have used base `r`. — MKR, Feb 07 '18 at 22:59

score 1 · Answer 2 · answered Feb 07 '18 at 22:29

Here is a solution using an expanded set of your example data. The key parts of this approach are using lubridate::ymd_hms to convert strings into times that you can do arithmetic with, and then lag to figure out whether or not time is within a minute of the previous row. Then, you can use a for loop to create the groups by incrementing the group number every time you get to a row that is not within one minute of the previous row. Can certainly be neatened up a little, and would love to see if anyone can do this without resorting to a for loop and bind_cols!

library(tidyverse)
tbl <- tibble(
  id = 1:8,
  time = c("2017-05-12 18:34:20",
           "2017-05-12 18:37:07",
           "2017-05-12 18:37:10",
           "2017-05-12 18:37:59",
           "2017-05-12 18:38:40",
           "2017-05-12 18:40:40",
           "2017-05-12 18:40:49",
           "2017-05-12 18:43:40"
           )
)

tbl2 <- tbl %>%
  mutate(time = ymd_hms(time)) %>%
  mutate(separation = time - lag(time, default = 0)) %>%
  mutate(onemin = separation <= 60)

group_ids = 1
for (i in 2:nrow(tbl2)){
  if (tbl2$onemin[i] == FALSE){
    group_ids[i] <- group_ids[i - 1] +1
  } else
  group_ids[i] <- group_ids[i - 1]
}

tbl2 %>%
  bind_cols(., group = group_ids) %>%
  select(id, time, group)

# A tibble: 8 x 3
     id time                group
  <int> <dttm>              <dbl>
1     1 2017-05-12 18:34:20  1.00
2     2 2017-05-12 18:37:07  2.00
3     3 2017-05-12 18:37:10  2.00
4     4 2017-05-12 18:37:59  2.00
5     5 2017-05-12 18:38:40  2.00
6     6 2017-05-12 18:40:40  3.00
7     7 2017-05-12 18:40:49  3.00
8     8 2017-05-12 18:43:40  4.00

MKR · Answer 3 · 2018-02-07T23:11:44.397

One possible solution could be using lag function from dplyr package and cumsum form base r.

The approach is:

Find time difference between each row in seconds
If difftime is more than 60 than that row is in new group (newgroup)
Perform cumsum on newgroup to get group number of each row.

The code is:

    #data
    library(dplyr)
df <- structure(list(id = 1:20, user = c(4L, 4L, 4L, 4L, 4L, 4L, 4L, 
     4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), 
     time = structure(c(1494606860,1494607027, 1494607030, 1494607172, 1494607173, 1494607197, 1494607198, 
          1494607200, 1494607309, 1494607312, 1494607339, 1494607340, 1494607343, 
           1494607343, 1494607404, 1494607405, 1494607407, 1494607492, 1494607493, 
          1494607495), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = c("id", 
          "user", "time"), row.names = c(NA, -20L), class = c("tbl_df", 
              "tbl", "data.frame"))



df %>% mutate(difftime = ifelse(is.na(as.numeric(time - lag(time))),0,as.numeric(time - lag(time)))) %>%
       mutate(newroup = ifelse(difftime > 60, 1, 0)) %>%
       mutate(group = factor(cumsum(newroup))) %>%
      select(id, user,time, group)

    #Result
      id  user time                group 
   <int> <int> <dttm>              <fctr>
 1     1     4 2017-05-12 17:34:20 0     
 2     2     4 2017-05-12 17:37:07 1     
 3     3     4 2017-05-12 17:37:10 1     
 4     4     4 2017-05-12 17:39:32 2     
 5     5     4 2017-05-12 17:39:33 2     
 6     6     4 2017-05-12 17:39:57 2     
 7     7     4 2017-05-12 17:39:58 2     
 8     8     4 2017-05-12 17:40:00 2     
 9     9     4 2017-05-12 17:41:49 3     
10    10     4 2017-05-12 17:41:52 3     
11    11     4 2017-05-12 17:42:19 3     
12    12     4 2017-05-12 17:42:20 3     
13    13     4 2017-05-12 17:42:23 3     
14    14     4 2017-05-12 17:42:23 3     
15    15     4 2017-05-12 17:43:24 4     
16    16     4 2017-05-12 17:43:25 4     
17    17     4 2017-05-12 17:43:27 4     
18    18     4 2017-05-12 17:44:52 5     
19    19     4 2017-05-12 17:44:53 5     
20    20     4 2017-05-12 17:44:55 5

Group by time proximity between adjecent rows

3 Answers3