0

I'm trying to solve a problem concerning data restructuring after matching two sets of rows according to a criteria.

Take for example data.table A:

library(data.table)
A <- data.table(startTime = c('2016-01-31 20:00:53', '2016-01-31 14:00:38', '2016-01-31 20:00:40', '2016-01-31 19:00:30', '2016-01-31 21:00:22'),
           endTime = c('2016-02-01 08:00:11', '2016-01-31 18:01:25', '2016-02-01 08:00:14', '2016-02-01 09:01:25', '2016-02-03 07:00:29'))

             startTime             endTime
1: 2016-01-31 20:00:53 2016-02-01 08:00:11
2: 2016-01-31 14:00:38 2016-01-31 18:01:25
3: 2016-01-31 20:00:40 2016-02-01 08:00:14
4: 2016-01-31 19:00:30 2016-02-01 09:01:25
5: 2016-01-31 21:00:22 2016-02-03 07:00:29

If I wanted to find the pairwise matches between A and itself, to determine which of these rows overlapped in their time range, I could use foverlaps from data.table (an overlap join).

A[, startTime := ymd_hms(startTime)][, endTime := ymd_hms(endTime)]
setkey(A, startTime, endTime)

foverlaps(A, A, which = T, type = "any")

And the results:

   xid yid
 1:   1   1
 2:   2   2
 3:   2   3
 4:   2   4
 5:   2   5
 6:   3   2
 7:   3   3
 8:   3   4
 9:   3   5

This is an example of the criteria I would like to match by. In other cases, I've matched by a distance between points (using location data), timestamps (rather than time intervals), etc. This matching according to a criteria really isn't the issue...

The underlying trouble I run into is how to resort these matches into groups that contain both from A -> A (L to R) and A -> A (R to L), the bidirectional pairwise matches.

This is, according to a chain rule, where if (the spatial example) A is within 100m of B and B is within 100m of C then A, B and C should be grouped together. Similarly, the time interval or time range example shown above, if A overlaps with B (in time) and B with C, then the resulting group should contain A, B and C.

A simple group membership is done in data.table thanks to .GRP for time/date equality comparisons:

locs[ , timeGroup := .GRP , by = time]

However, for more complicated comparison tests, the results are often in this 2 column match structure.

From the data above then, groups should be assigned like this:

     xid groupID
 1:   1   1
 2:   2   2
 3:   3   2
 4:   4   2
 5:   5   2

Therefore, I'm looking for a solution to group rows, based off some matching criteria, comparing all rows within a data.table with themselves. The result should be a data.table/frame with the original input locs, appended with a new column for groupID.

Thank you.

Frank
  • 66,179
  • 8
  • 96
  • 180
Alec
  • 63
  • 1
  • 3
  • Relevant - http://stackoverflow.com/questions/12135971/identify-groups-of-linked-episodes-which-chain-together – thelatemail Feb 10 '17 at 00:50
  • Per thelatemail's link, you should probably be using igraph or some other graph-theory library: enter the data, find connected components and you're done. Shameless plug of my answers: http://stackoverflow.com/search?q=user%3A1191259+connected+ – Frank Feb 10 '17 at 01:55
  • Great, thanks. Should've struck me that this was much simpler when I took a graph theory approach.. Ended up using asnipe's graph_from_edge_list function... – Alec Feb 14 '17 at 00:34

0 Answers0