Expand data.table with combinations of two columns given condition in another column

Question

I have a data.table that gives me the connections between locations (origin and destination) for different bus routes (route_id).

library(data.table)
library(magrittr)

# data for reproducible example
  dt <- data.table( origin = c('A','B','C', 'F', 'G', 'H'), 
                    destination = c('B','C','D', 'G', 'H', 'I'),
                    freq = c(2,2,2,10,10,10),
                    route_id = c(1,1,1,2,2,2), stringsAsFactors=FALSE )
# > dt
#    origin destination freq route_id
# 1:      A           B    2        1
# 2:      B           C    2        1
# 3:      C           D    2        1
# 4:      F           G   10        2
# 5:      G           H   10        2
# 6:      H           I   10        2

For the purposes of what I'd want to do, if there is a route_id that gives a connection A-B and a connection B-C, then I want to add to the data a connection A-C for that same route_id and so on.

Problems: So far, I've created a simple code that does this job but:

it uses a for loop that takes a long time (my real data has hundreds of thousands observations)
it still does not cope well with direction. The direction of the connections matter here. So although there is a B-C connection in the original data, there should be no C-B in the output.

My slow solution

 # loop
   # a) get a data subset corresponding to each route_id
   # b) get all combinations of origin-destination pairs 
   # c) row bind the new pairs to original data
   for (i in unique(dt$route_id)) {
               temp <- dt[ route_id== i,]
               subset_of_pairs <- expand.grid(temp$origin, temp$destination) %>% setDT()
               setnames(subset_of_pairs, c("origin", "destination"))
               dt <- rbind(dt, subset_of_pairs, fill=T)
               }

# assign route_id and freq to new pairs
  dt[, route_id := route_id[1L], by=origin]
  dt[, freq := freq[1L], by=route_id]

# Keepe only different pairs that are unique
  dt[, origin := as.character(origin) ][, destination := as.character(destination) ]
  dt <- dt[ origin != destination, ][order(route_id, origin, destination)]
  dt <- unique(dt)

Desired output

    origin destination freq route_id
 1:      A           B    2        1
 2:      A           C    2        1
 3:      A           D    2        1
 4:      B           C    2        1
 5:      B           D    2        1
 6:      C           D    2        1
 7:      F           G   10        2
 8:      F           H   10        2
 9:      F           I   10        2
10:      G           H   10        2
11:      G           I   10        2
12:      H           I   10        2

In your hundreds of thousands of observations, how large is a typical `route_id`? That is, are there many small routes, or a smaller number of larger routes? This affects what solutions would be more efficient — David Robinson, May 01 '17 at 20:34

Frank · Accepted Answer · 2017-05-04T15:06:32.323

One way:

res = dt[, {
  stops = c(origin, last(destination))
  pairs = combn(.N + 1L, 2L)
  .(o = stops[pairs[1,]], d = stops[pairs[2,]])
}, by=route_id]

    route_id o d
 1:        1 A B
 2:        1 A C
 3:        1 A D
 4:        1 B C
 5:        1 B D
 6:        1 C D
 7:        2 F G
 8:        2 F H
 9:        2 F I
10:        2 G H
11:        2 G I
12:        2 H I

This is assuming that c(origin, last(destination)) is a full list of stops in order. If dt does not contain enough info to construct a complete order, the task becomes much more difficult.

If vars from dt are needed, an update join like res[dt, on=.(route_id), freq := i.freq] works.

Tasks like this always risk running out of memory. In this case, the OP has up to a million rows containing groups of up to 341 stops, so the end result could be as large as 1e6/341*choose(341,2) = 170 million rows. That's manageable, but in general this sort of analysis does not scale.

How it works

Generally, data.table syntax can be treated just like a loop over groups:

DT[, { 
  ...
}, by=g]

This has a few advantages over loops:

Nothing created in the ... body will pollute the workspace.
All columns can be referenced by name.
Special symbols .N, .SD, .GRP and .BY are available, along with .() for list().

In the code above, pairs finds pairs of indices taken from 1 .. #stops (=.N+1 where .N is the number of rows in the subset of the data associated with a given route_id). It is a matrix with the first row corresponding to the first element of a pair; and the second row with the second. The ... should evaluate to a list of columns; and here list() is abbreviated as .().

Further improvements

I guess the time is mostly devoted to computing combn many times. If multiple routes have the same #stops, this can be addressed by computing beforehand:

Ns = dt[,.N, by=route_id][, unique(N)]
cb = lapply(setNames(,Ns), combn, 2)

Then grab pairs = cb[[as.character(.N)]] in the main code. Alternately, define a pairs function that uses memoization to avoid recomputing.

Yes, the stops are ordered in the data and your answer is super fast and elegant. I must confess though I've never seen the use of `{}` within `data.table`. Could you briefly explain what the script is doing on the 3rd and 4rth lines? Thanks for the response, @Frank ! — rafa.pereira, May 01 '17 at 22:03
@rafa Cool, glad it's fast! I wasn't sure it would be. For more intuition for data.table syntax, see the vignettes. And if you want intuition for joins, maybe see my notes (since there's not a vignette on those yet): http://franknarf1.github.io/r-tutorial/_book/tables.html#tables — Frank, May 01 '17 at 22:36
why not do the main operation like this: `res = dt[, {...}, by= .(route_id, freq)] `? I noticed that in my dataset the two operations give different results. I believe it's because some `route_id` have more than one value of `freq` (say, weekday and weekend trips) — rafa.pereira, May 02 '17 at 15:02
@rafa.pereira Yeah, if you want to group by freq as well, that works. I assumed freq was an attribute of the route_id, in which case I wouldn't leave it in the `by=` since I would think it belongs in a new table for route_id attributes. — Frank, May 02 '17 at 15:16
reporting back, I've just applied @Frank's solution to an extended version of my dataset with 4.4 million rows and it ran smoothly in my laptop (16GM of RAM and intel i7). It took ~1 minute and the output has 141.2 million rows — rafa.pereira, May 04 '17 at 13:16
@rafa.pereira Cool. If you have many groups with the same size (routes with the same number of stops), I guess you can make it faster by only computing pairs once per `.N` (rather than once per route), as shown in the edit above. — Frank, May 04 '17 at 15:07

Expand data.table with combinations of two columns given condition in another column

My slow solution

Desired output

1 Answers1