I'm trying to solve a problem concerning data restructuring after matching two sets of rows according to a criteria.
Take for example data.table A:
library(data.table)
A <- data.table(startTime = c('2016-01-31 20:00:53', '2016-01-31 14:00:38', '2016-01-31 20:00:40', '2016-01-31 19:00:30', '2016-01-31 21:00:22'),
endTime = c('2016-02-01 08:00:11', '2016-01-31 18:01:25', '2016-02-01 08:00:14', '2016-02-01 09:01:25', '2016-02-03 07:00:29'))
startTime endTime
1: 2016-01-31 20:00:53 2016-02-01 08:00:11
2: 2016-01-31 14:00:38 2016-01-31 18:01:25
3: 2016-01-31 20:00:40 2016-02-01 08:00:14
4: 2016-01-31 19:00:30 2016-02-01 09:01:25
5: 2016-01-31 21:00:22 2016-02-03 07:00:29
If I wanted to find the pairwise matches between A and itself, to determine which of these rows overlapped in their time range, I could use foverlaps
from data.table (an overlap join).
A[, startTime := ymd_hms(startTime)][, endTime := ymd_hms(endTime)]
setkey(A, startTime, endTime)
foverlaps(A, A, which = T, type = "any")
And the results:
xid yid
1: 1 1
2: 2 2
3: 2 3
4: 2 4
5: 2 5
6: 3 2
7: 3 3
8: 3 4
9: 3 5
This is an example of the criteria I would like to match by. In other cases, I've matched by a distance between points (using location data), timestamps (rather than time intervals), etc. This matching according to a criteria really isn't the issue...
The underlying trouble I run into is how to resort these matches into groups that contain both from A -> A (L to R) and A -> A (R to L), the bidirectional pairwise matches.
This is, according to a chain rule, where if (the spatial example) A is within 100m of B and B is within 100m of C then A, B and C should be grouped together. Similarly, the time interval or time range example shown above, if A overlaps with B (in time) and B with C, then the resulting group should contain A, B and C.
A simple group membership is done in data.table
thanks to .GRP
for time/date equality comparisons:
locs[ , timeGroup := .GRP , by = time]
However, for more complicated comparison tests, the results are often in this 2 column match structure.
From the data above then, groups should be assigned like this:
xid groupID
1: 1 1
2: 2 2
3: 3 2
4: 4 2
5: 5 2
Therefore, I'm looking for a solution to group rows, based off some matching criteria, comparing all rows within a data.table with themselves. The result should be a data.table/frame with the original input locs, appended with a new column for groupID.
Thank you.