5

I have a data.table that gives me the connections between locations (origin and destination) for different bus routes (route_id).

library(data.table)
library(magrittr)

# data for reproducible example
  dt <- data.table( origin = c('A','B','C', 'F', 'G', 'H'), 
                    destination = c('B','C','D', 'G', 'H', 'I'),
                    freq = c(2,2,2,10,10,10),
                    route_id = c(1,1,1,2,2,2), stringsAsFactors=FALSE )
# > dt
#    origin destination freq route_id
# 1:      A           B    2        1
# 2:      B           C    2        1
# 3:      C           D    2        1
# 4:      F           G   10        2
# 5:      G           H   10        2
# 6:      H           I   10        2

For the purposes of what I'd want to do, if there is a route_id that gives a connection A-B and a connection B-C, then I want to add to the data a connection A-C for that same route_id and so on.

Problems: So far, I've created a simple code that does this job but:

  1. it uses a for loop that takes a long time (my real data has hundreds of thousands observations)
  2. it still does not cope well with direction. The direction of the connections matter here. So although there is a B-C connection in the original data, there should be no C-B in the output. enter image description here

My slow solution

 # loop
   # a) get a data subset corresponding to each route_id
   # b) get all combinations of origin-destination pairs 
   # c) row bind the new pairs to original data
   for (i in unique(dt$route_id)) {
               temp <- dt[ route_id== i,]
               subset_of_pairs <- expand.grid(temp$origin, temp$destination) %>% setDT()
               setnames(subset_of_pairs, c("origin", "destination"))
               dt <- rbind(dt, subset_of_pairs, fill=T)
               }

# assign route_id and freq to new pairs
  dt[, route_id := route_id[1L], by=origin]
  dt[, freq := freq[1L], by=route_id]

# Keepe only different pairs that are unique
  dt[, origin := as.character(origin) ][, destination := as.character(destination) ]
  dt <- dt[ origin != destination, ][order(route_id, origin, destination)]
  dt <- unique(dt)

Desired output

    origin destination freq route_id
 1:      A           B    2        1
 2:      A           C    2        1
 3:      A           D    2        1
 4:      B           C    2        1
 5:      B           D    2        1
 6:      C           D    2        1
 7:      F           G   10        2
 8:      F           H   10        2
 9:      F           I   10        2
10:      G           H   10        2
11:      G           I   10        2
12:      H           I   10        2
rafa.pereira
  • 13,251
  • 6
  • 71
  • 109
  • 1
    In your hundreds of thousands of observations, how large is a typical `route_id`? That is, are there many small routes, or a smaller number of larger routes? This affects what solutions would be more efficient – David Robinson May 01 '17 at 20:34
  • the median `route_id` has 82 ordinary pairs. Mix =2 Max=340 – rafa.pereira May 01 '17 at 21:10

1 Answers1

4

One way:

res = dt[, {
  stops = c(origin, last(destination))
  pairs = combn(.N + 1L, 2L)
  .(o = stops[pairs[1,]], d = stops[pairs[2,]])
}, by=route_id]

    route_id o d
 1:        1 A B
 2:        1 A C
 3:        1 A D
 4:        1 B C
 5:        1 B D
 6:        1 C D
 7:        2 F G
 8:        2 F H
 9:        2 F I
10:        2 G H
11:        2 G I
12:        2 H I

This is assuming that c(origin, last(destination)) is a full list of stops in order. If dt does not contain enough info to construct a complete order, the task becomes much more difficult.

If vars from dt are needed, an update join like res[dt, on=.(route_id), freq := i.freq] works.

Tasks like this always risk running out of memory. In this case, the OP has up to a million rows containing groups of up to 341 stops, so the end result could be as large as 1e6/341*choose(341,2) = 170 million rows. That's manageable, but in general this sort of analysis does not scale.


How it works

Generally, data.table syntax can be treated just like a loop over groups:

DT[, { 
  ...
}, by=g]

This has a few advantages over loops:

  • Nothing created in the ... body will pollute the workspace.
  • All columns can be referenced by name.
  • Special symbols .N, .SD, .GRP and .BY are available, along with .() for list().

In the code above, pairs finds pairs of indices taken from 1 .. #stops (=.N+1 where .N is the number of rows in the subset of the data associated with a given route_id). It is a matrix with the first row corresponding to the first element of a pair; and the second row with the second. The ... should evaluate to a list of columns; and here list() is abbreviated as .().


Further improvements

I guess the time is mostly devoted to computing combn many times. If multiple routes have the same #stops, this can be addressed by computing beforehand:

Ns = dt[,.N, by=route_id][, unique(N)]
cb = lapply(setNames(,Ns), combn, 2)

Then grab pairs = cb[[as.character(.N)]] in the main code. Alternately, define a pairs function that uses memoization to avoid recomputing.

Frank
  • 66,179
  • 8
  • 96
  • 180
  • Yes, the stops are ordered in the data and your answer is super fast and elegant. I must confess though I've never seen the use of `{}` within `data.table`. Could you briefly explain what the script is doing on the 3rd and 4rth lines? Thanks for the response, @Frank ! – rafa.pereira May 01 '17 at 22:03
  • 1
    @rafa Cool, glad it's fast! I wasn't sure it would be. For more intuition for data.table syntax, see the vignettes. And if you want intuition for joins, maybe see my notes (since there's not a vignette on those yet): http://franknarf1.github.io/r-tutorial/_book/tables.html#tables – Frank May 01 '17 at 22:36
  • why not do the main operation like this: `res = dt[, {...}, by= .(route_id, freq)] `? I noticed that in my dataset the two operations give different results. I believe it's because some `route_id` have more than one value of `freq` (say, weekday and weekend trips) – rafa.pereira May 02 '17 at 15:02
  • 1
    @rafa.pereira Yeah, if you want to group by freq as well, that works. I assumed freq was an attribute of the route_id, in which case I wouldn't leave it in the `by=` since I would think it belongs in a new table for route_id attributes. – Frank May 02 '17 at 15:16
  • 1
    reporting back, I've just applied @Frank's solution to an extended version of my dataset with 4.4 million rows and it ran smoothly in my laptop (16GM of RAM and intel i7). It took ~1 minute and the output has 141.2 million rows – rafa.pereira May 04 '17 at 13:16
  • @rafa.pereira Cool. If you have many groups with the same size (routes with the same number of stops), I guess you can make it faster by only computing pairs once per `.N` (rather than once per route), as shown in the edit above. – Frank May 04 '17 at 15:07