6

Edit: I'm looking for solution for this question now also with other programming languages.

Based on the other question I asked, I have a dataset like this (for R users, dput for this below) which represents user computer sessions:

   username          machine               start                 end
1     user1 D5599.domain.com 2011-01-03 09:44:18 2011-01-03 09:47:27
2     user1 D5599.domain.com 2011-01-03 09:46:29 2011-01-03 10:09:16
3     user1 D5599.domain.com 2011-01-03 14:07:36 2011-01-03 14:56:17
4     user1 D5599.domain.com 2011-01-05 15:03:17 2011-01-05 15:23:15
5     user1 D5599.domain.com 2011-02-14 14:33:39 2011-02-14 14:40:16
6     user1 D5599.domain.com 2011-02-23 13:54:30 2011-02-23 13:58:23
7     user1 D5599.domain.com 2011-03-21 10:10:18 2011-03-21 10:32:22
8     user1 D5645.domain.com 2011-06-09 10:12:41 2011-06-09 10:58:59
9     user1 D5682.domain.com 2011-01-03 12:03:45 2011-01-03 12:29:43
10    USER2 D5682.domain.com 2011-01-12 14:26:05 2011-01-12 14:32:53
11    USER2 D5682.domain.com 2011-01-17 15:06:19 2011-01-17 15:44:22
12    USER2 D5682.domain.com 2011-01-18 15:07:30 2011-01-18 15:42:43
13    USER2 D5682.domain.com 2011-01-25 15:20:55 2011-01-25 15:24:38
14    USER2 D5682.domain.com 2011-02-14 15:03:00 2011-02-14 15:07:43
15    USER2 D5682.domain.com 2011-02-14 14:59:23 2011-02-14 15:14:47
>

There may be several concurrent (overlapping based on time) sessions for the same username from the the same computer. How can I remove those rows so that only one sessions is left for this data? Original data set has approx. 500 000 rows.

The expected output is (rows 2, 15 removed)

   username          machine               start                 end
1     user1 D5599.domain.com 2011-01-03 09:44:18 2011-01-03 09:47:27
3     user1 D5599.domain.com 2011-01-03 14:07:36 2011-01-03 14:56:17
4     user1 D5599.domain.com 2011-01-05 15:03:17 2011-01-05 15:23:15
5     user1 D5599.domain.com 2011-02-14 14:33:39 2011-02-14 14:40:16
6     user1 D5599.domain.com 2011-02-23 13:54:30 2011-02-23 13:58:23
7     user1 D5599.domain.com 2011-03-21 10:10:18 2011-03-21 10:32:22
8     user1 D5645.domain.com 2011-06-09 10:12:41 2011-06-09 10:58:59
9     user1 D5682.domain.com 2011-01-03 12:03:45 2011-01-03 12:29:43
10    USER2 D5682.domain.com 2011-01-12 14:26:05 2011-01-12 14:32:53
11    USER2 D5682.domain.com 2011-01-17 15:06:19 2011-01-17 15:44:22
12    USER2 D5682.domain.com 2011-01-18 15:07:30 2011-01-18 15:42:43
13    USER2 D5682.domain.com 2011-01-25 15:20:55 2011-01-25 15:24:38
14    USER2 D5682.domain.com 2011-02-14 15:03:00 2011-02-14 15:07:43
>

Here is the dataset:

structure(list(username = c("user1", "user1", "user1",
"user1", "user1", "user1", "user1", "user1",
"user1", "USER2", "USER2", "USER2", "USER2", "USER2", "USER2"
), machine = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 3L,
3L, 3L, 3L, 3L, 3L, 3L), .Label = c("D5599.domain.com", "D5645.domain.com",
"D5682.domain.com", "D5686.domain.com", "D5694.domain.com", "D5696.domain.com",
"D5772.domain.com", "D5772.domain.com", "D5847.domain.com", "D5855.domain.com",
"D5871.domain.com", "D5927.domain.com", "D5927.domain.com", "D5952.domain.com",
"D5993.domain.com", "D6012.domain.com", "D6048.domain.com", "D6077.domain.com",
"D5688.domain.com", "D5815.domain.com", "D6106.domain.com", "D6128.domain.com"
), class = "factor"), start = structure(c(1294040658, 1294040789,
1294056456, 1294232597, 1297686819, 1298462070, 1300695018, 1307603561,
1294049025, 1294835165, 1295269579, 1295356050, 1295961655, 1297688580,
1297688363), class = c("POSIXct", "POSIXt"), tzone = ""), end =
structure(c(1294040847,
1294042156, 1294059377, 1294233795, 1297687216, 1298462303, 1300696342,
1307606339, 1294050583, 1294835573, 1295271862, 1295358163, 1295961878,
1297688863, 1297689287), class = c("POSIXct", "POSIXt"), tzone = "")),
.Names = c("username",
"machine", "start", "end"), row.names = c(NA, 15L), class = "data.frame")
Community
  • 1
  • 1
jrara
  • 16,239
  • 33
  • 89
  • 120

5 Answers5

4

Try the intervals package:

library(intervals)

f <- function(dd) with(dd, {
    r <- reduce(Intervals(cbind(start, end)))
    data.frame(username = username[1],
         machine = machine[1],
         start = structure(r[, 1], class = class(start)),
         end = structure(r[, 2], class = class(end)))
})

do.call("rbind", by(d, d[1:2], f))

With the sample data this reduces the 15 rows to the following 13 rows (by combining rows 1 and 2 and rows 12 and 13 in the original data frame):

   username          machine               start                 end
1     user1 D5599.domain.com 2011-01-03 02:44:18 2011-01-03 03:09:16
2     user1 D5599.domain.com 2011-01-03 07:07:36 2011-01-03 07:56:17
3     user1 D5599.domain.com 2011-01-05 08:03:17 2011-01-05 08:23:15
4     user1 D5599.domain.com 2011-02-14 07:33:39 2011-02-14 07:40:16
5     user1 D5599.domain.com 2011-02-23 06:54:30 2011-02-23 06:58:23
6     user1 D5599.domain.com 2011-03-21 04:10:18 2011-03-21 04:32:22
7     user1 D5645.domain.com 2011-06-09 03:12:41 2011-06-09 03:58:59
8     user1 D5682.domain.com 2011-01-03 05:03:45 2011-01-03 05:29:43
9     USER2 D5682.domain.com 2011-01-12 07:26:05 2011-01-12 07:32:53
10    USER2 D5682.domain.com 2011-01-17 08:06:19 2011-01-17 08:44:22
11    USER2 D5682.domain.com 2011-01-18 08:07:30 2011-01-18 08:42:43
12    USER2 D5682.domain.com 2011-01-25 08:20:55 2011-01-25 08:24:38
13    USER2 D5682.domain.com 2011-02-14 07:59:23 2011-02-14 08:14:47
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • Thanks, this seems to do what I want! I'll have to check if this is efficient enough for my original data.frame, with 500 000 rows. – jrara Jan 26 '12 at 13:47
  • I would give you +100 if I could! I did not get this work in my main data because it exceeded the memory in my computer but I got it work with a smaller subset of data (one month). That's ok for me. – jrara Jan 26 '12 at 18:47
1

One solution is to first split the intervals, so that they are sometimes equal but never partially overlap, and them remove the duplicates. The problem is that we are left with many small abutting intervals, and merging them does not look straightforward.

library(reshape2)
library(sqldf)
d$machine <- as.character( d$machine ) # Duplicated levels...
ddply( d, c("username", "machine"), function (u) {
  # For each username and machine, 
  # compute all the possible non-overlapping intervals
  intervals <- sort(unique( c(u$start, u$end) ))
  intervals <- data.frame( 
    start = intervals[-length(intervals)], 
    end   = intervals[-1] 
  )
  # Only retain those actually in the data
  u <- sqldf( "
    SELECT DISTINCT u.username, u.machine, 
                    intervals.start, intervals.end
    FROM  u, intervals 
    WHERE       u.start <= intervals.start 
    AND   intervals.end <=         u.end
  " )
  # We have non-overlapping, but potentially abutting intervals:
  # ideally, we should merge them, but I do not see an easy 
  # way to do so.
  u
} )

EDIT: Another, conceptually cleaner solution, that fixes the non-merged abutting intervals problem, is to count the number of open sessions for each user and machine: when it stops being zero, the user has logged in (with one or more session), when it drops to zero, the user has closed all his/her sessions.

ddply( d, c("username", "machine"), function (u) {
  a <- rbind( 
    data.frame( time = min(u$start) - 1, sessions = 0 ),
    data.frame( time = u$start, sessions = 1 ), 
    data.frame( time = u$end,   sessions = -1 ) 
  )
  a <- a[ order(a$time), ]
  a$sessions <- cumsum(a$sessions)
  a$previous <- c( 0, a$sessions[ - nrow(a) ] )
  a <- a[ a$previous == 0 & a$sessions  > 0 | 
          a$previous  > 0 & a$sessions == 0, ]
  a$previous_time <- a$time
  a$previous_time[-1] <- a$time[ -nrow(a) ]
  a <- a[ a$previous > 0 & a$sessions == 0, ]
  a <- data.frame( 
    username = u$username[1],
    machine  = u$machine[1],
    start = a$previous_time,
    end   = a$time
  )
  a
} )
Vincent Zoonekynd
  • 31,893
  • 5
  • 69
  • 78
  • Thanks for your contribution. This seems to add rows, not remove them. This seems to be quite difficult task to solve. – jrara Jan 26 '12 at 11:59
  • If a user has two sessions on the same machine, say from 08:00 to 10:00 and from 09:00 to 12:00, those overlapping intervals are split into 08:00 to 09:00, 09:00 to 10:00 and 10:00 to 12:00: there are no more overlapping intervals, there are no duplicates, but since the intervals have been cut into smaller pieces, there are more of them. Merging them when they abut (here, into a single interval 09:00 to 12:00) looks harder. – Vincent Zoonekynd Jan 26 '12 at 12:14
  • I'm not sure if I understand what you mean. My idea was to remove thos rows (except one of them) for user from the same computer which have overlapping interval. E.g if I have intervals 8.30 to 9:30 and 8:45 to 10:00, the 8:45 to 10:00 should be removed because it has one overlapping interval row in the data. – jrara Jan 26 '12 at 12:18
  • I added the expected output into my question. – jrara Jan 26 '12 at 12:21
  • I have found a way to merge the intervals and edited my answer. If you really want to keep only the first connection and discard the others (somehow, I do not like to discard data), you can just join the result with the initial data on the "username", "machine" and "start" columns, to get the end of the first session instead of the last. – Vincent Zoonekynd Jan 26 '12 at 12:46
  • Thanks, yes, in this case I want remove those rows, the idea is just get user counts so if the user has more than one session open, just discard those session and count one session. My orginal data.frame has 500 000 rows. – jrara Jan 26 '12 at 13:07
1

Alternate solution using the interval class from lubridate.

library(lubridate)
int <- with(d, new_interval(start, end))

Now we need a function to test for overlaps. See Determine Whether Two Date Ranges Overlap.

int_overlaps <- function(int1, int2)
{
  (int_start(int1) <= int_end(int2)) & 
  (int_start(int2) <= int_end(int1))
}

Now call this on all pairs of intervals.

index <- combn(seq_along(int), 2)
overlaps <- int_overlaps(int[index[1, ]], int[index[2, ]])

The overlapping rows:

int[index[1, overlaps]]
int[index[2, overlaps]]

And the rows to remove are simply index[2, overlaps].

Community
  • 1
  • 1
Richie Cotton
  • 118,240
  • 47
  • 247
  • 360
  • Thanks, it works perfectly with the example data. With my original data I get an error: > index <- combn(seq_along(int), 2) Error in matrix(r, nrow = len.r, ncol = count) : invalid 'ncol' value (too large or NA) In addition: Warning message: In combn(seq_along(int), 2) : NAs introduced by coercion I quess my data set is too large for this? – jrara Jan 26 '12 at 17:46
  • @jrara: Yes, retrospectively, the use of `combn` is total overkill and not a good idea because of the memory usage. You can still use `lubridate` intervals, but try them with the algorithm in Hobb's answer. – Richie Cotton Jan 27 '12 at 12:02
1

Pseudocode solution: O(n log n), O(n) if the data is already known to be sorted properly.

Start by sorting the data by user, by machine, and by start time (so that all of the rows for a given user on a given machine are grouped together, and the rows within each group are in ascending order of start time).

  1. Initialize the "working interval" to null/nil/undef/etc.

  2. For each row in order:

    • If the working interval exists and belongs to a different user or a different machine than the current row, output and clear the working interval.
    • If the working interval exists and its end time is strictly before the start time of the current row, output and clear the working interval.
    • If the working interval exists, then it must belong to the same user and machine, and overlap or abut the current row's interval, so set the working interval's end time to the current row's end time.
    • Else, the working interval doesn't exist, so set the working interval to the current row.
  3. Finally, if the working interval exists, output it.

hobbs
  • 223,387
  • 19
  • 210
  • 288
  • Simple and language agnostic. Though I think the order/group by would be by user AND machine. – runrig Jan 26 '12 at 18:25
  • Whoops, I missed "machine" entirely. The same idea applies, extended to whatever extra fields there are. – hobbs Jan 26 '12 at 19:19
1

Don't know if this is what you're after or not, or if it would work any better than what you already have. It's a powershell solution that uses a hash table with keys that are a combination of the username and computername. The values are a hash of the start and end times.

If a key (session) already exists it updates the end time. If it doesn't it creates one and sets the start time and initial end time. As it encounters new session records for that user/computer in the log, it updates the end time of the session key.

 $ht = @{}
 import-csv <logfile> |
    foreach{
      $key = $_.username + $_.computername
      if ($ht.ContainsKey($key)){$ht.$key.end = $_.end}
      else{$ht.add("$key",@{start=$_.start;end=$_.end}}
       }

You'd need to split the user and computer names back out of the keys when it's done.

mjolinor
  • 66,130
  • 7
  • 114
  • 135