Edit: I'm looking for solution for this question now also with other programming languages.
Based on the other question I asked, I have a dataset like this (for R users, dput for this below) which represents user computer sessions:
username machine start end
1 user1 D5599.domain.com 2011-01-03 09:44:18 2011-01-03 09:47:27
2 user1 D5599.domain.com 2011-01-03 09:46:29 2011-01-03 10:09:16
3 user1 D5599.domain.com 2011-01-03 14:07:36 2011-01-03 14:56:17
4 user1 D5599.domain.com 2011-01-05 15:03:17 2011-01-05 15:23:15
5 user1 D5599.domain.com 2011-02-14 14:33:39 2011-02-14 14:40:16
6 user1 D5599.domain.com 2011-02-23 13:54:30 2011-02-23 13:58:23
7 user1 D5599.domain.com 2011-03-21 10:10:18 2011-03-21 10:32:22
8 user1 D5645.domain.com 2011-06-09 10:12:41 2011-06-09 10:58:59
9 user1 D5682.domain.com 2011-01-03 12:03:45 2011-01-03 12:29:43
10 USER2 D5682.domain.com 2011-01-12 14:26:05 2011-01-12 14:32:53
11 USER2 D5682.domain.com 2011-01-17 15:06:19 2011-01-17 15:44:22
12 USER2 D5682.domain.com 2011-01-18 15:07:30 2011-01-18 15:42:43
13 USER2 D5682.domain.com 2011-01-25 15:20:55 2011-01-25 15:24:38
14 USER2 D5682.domain.com 2011-02-14 15:03:00 2011-02-14 15:07:43
15 USER2 D5682.domain.com 2011-02-14 14:59:23 2011-02-14 15:14:47
>
There may be several concurrent (overlapping based on time) sessions for the same username from the the same computer. How can I remove those rows so that only one sessions is left for this data? Original data set has approx. 500 000 rows.
The expected output is (rows 2, 15 removed)
username machine start end
1 user1 D5599.domain.com 2011-01-03 09:44:18 2011-01-03 09:47:27
3 user1 D5599.domain.com 2011-01-03 14:07:36 2011-01-03 14:56:17
4 user1 D5599.domain.com 2011-01-05 15:03:17 2011-01-05 15:23:15
5 user1 D5599.domain.com 2011-02-14 14:33:39 2011-02-14 14:40:16
6 user1 D5599.domain.com 2011-02-23 13:54:30 2011-02-23 13:58:23
7 user1 D5599.domain.com 2011-03-21 10:10:18 2011-03-21 10:32:22
8 user1 D5645.domain.com 2011-06-09 10:12:41 2011-06-09 10:58:59
9 user1 D5682.domain.com 2011-01-03 12:03:45 2011-01-03 12:29:43
10 USER2 D5682.domain.com 2011-01-12 14:26:05 2011-01-12 14:32:53
11 USER2 D5682.domain.com 2011-01-17 15:06:19 2011-01-17 15:44:22
12 USER2 D5682.domain.com 2011-01-18 15:07:30 2011-01-18 15:42:43
13 USER2 D5682.domain.com 2011-01-25 15:20:55 2011-01-25 15:24:38
14 USER2 D5682.domain.com 2011-02-14 15:03:00 2011-02-14 15:07:43
>
Here is the dataset:
structure(list(username = c("user1", "user1", "user1",
"user1", "user1", "user1", "user1", "user1",
"user1", "USER2", "USER2", "USER2", "USER2", "USER2", "USER2"
), machine = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 3L,
3L, 3L, 3L, 3L, 3L, 3L), .Label = c("D5599.domain.com", "D5645.domain.com",
"D5682.domain.com", "D5686.domain.com", "D5694.domain.com", "D5696.domain.com",
"D5772.domain.com", "D5772.domain.com", "D5847.domain.com", "D5855.domain.com",
"D5871.domain.com", "D5927.domain.com", "D5927.domain.com", "D5952.domain.com",
"D5993.domain.com", "D6012.domain.com", "D6048.domain.com", "D6077.domain.com",
"D5688.domain.com", "D5815.domain.com", "D6106.domain.com", "D6128.domain.com"
), class = "factor"), start = structure(c(1294040658, 1294040789,
1294056456, 1294232597, 1297686819, 1298462070, 1300695018, 1307603561,
1294049025, 1294835165, 1295269579, 1295356050, 1295961655, 1297688580,
1297688363), class = c("POSIXct", "POSIXt"), tzone = ""), end =
structure(c(1294040847,
1294042156, 1294059377, 1294233795, 1297687216, 1298462303, 1300696342,
1307606339, 1294050583, 1294835573, 1295271862, 1295358163, 1295961878,
1297688863, 1297689287), class = c("POSIXct", "POSIXt"), tzone = "")),
.Names = c("username",
"machine", "start", "end"), row.names = c(NA, 15L), class = "data.frame")