7

Question

In the igraph R package, is there an efficient implementation of subcomponent() and/or BFS that can handle multiple source vertices?

Motivation

The drake R package models a user's workflow as a DAG of interdependent objects and files. The DAG should only contain the user's targets and their upstream dependencies, so drake uses igraph::subcomponent() to eliminate superfluous vertices. This approach is inefficient because the v argument must be a single vertex, so drake ends up conducting a new BFS for every single target the user wants to build.

EDIT: 2019-01-10

drake now uses a different approach that ultimately relies on sequential calls to adjacent_vertices(). The approach is clunky, but the speed improvement is actually quite nice. Still holding out for something more elegant and sophisticated.

landau
  • 5,636
  • 1
  • 22
  • 50
  • Can you maybe use neighbourhood functions to get all ancestors : `?ego` [& here](https://stackoverflow.com/questions/35603398/finding-the-complete-adjacency-matrix-in-r/35821304#35821304) – user20650 Jan 09 '19 at 23:16
  • `make_ego_graph()` creates a new graph for every single vertex in `v` and then puts those graphs together in a list. `ego()` is similar. I will check out the performance, but I am still looking for a more direct solution. – landau Jan 10 '19 at 02:41
  • Yeah, the performance improvement is really small. – landau Jan 10 '19 at 02:45
  • I took care of the performance issue by [applying `adjacent_vertices()` over and over again](https://github.com/ropensci/drake/pull/663), but [the technique I used](https://github.com/ropensci/drake/blob/9a23dde1e6d7babf2c832494198f18a25b59787e/R/utils-igraph.R#L46-L57) is really naive and unsatisfying. I am still holding out for something more sophisticated, if implemented already. – landau Jan 10 '19 at 03:57

1 Answers1

2

I think you can do this using the distances() function which generates a matrix of distances (no of edges) between nodes. This seems to do the search once and is a good deal faster than iterating through each vertex.

Example code:

library(igraph)
library(microbenchmark)

# generate some random testing data
set.seed(1234)
g <- erdos.renyi.game(50, .01)

# Here we make a function that iterates 
# across the vector of IDs applying the function
# and returns a list where each element is the
# ids of the subcomponents
sc_apply <- function(g) {
  vs <- V(g)
  res <- sapply(vs, function(v){as.numeric( # to facilitate comparison
    subcomponent(g, v, mode = "in")
    )})
  res
}

# Try it for testing
t1 <- sc_apply(g)

# Here we get the matrix of node distances. Infinite distance
# implies a seperate component. We iterate through rows of
# matrix to extract the set of nodes that are connected 
sc_distmat <- function(g) {
  dmat <- distances(g, mode = "in")
  res <- apply(dmat, 1, function(row){which(is.finite(row))})
  res
}

# extract for testing
t2 <- sc_distmat(g)

# check that equal (we need to sort the 
# subcomponent list elements first to facilitate comparison)
all.equal(lapply(t1, sort), t2)
#> [1] TRUE

The results are identical -- although worth noting that if your graph is one giant component than apply will return a matrix to you instead of a list so you will need to do your comparisons in a slightly different way.

Ok now lets see how much/whether this is faster:

# generate graphs of different sizes (not too big because my
# laptop is borderline antique!)
set.seed(42)
small_g <- erdos.renyi.game(10, .2)
mid_g <- erdos.renyi.game(50, .1)
big_g <- erdos.renyi.game(100, .1)

# check speed improvement
microbenchmark(sc_apply(small_g), sc_distmat(small_g))
#> Unit: microseconds
#>                 expr      min        lq      mean   median        uq
#>    sc_apply(small_g) 2181.465 2243.4895 2734.7132 2313.005 2777.7135
#>  sc_distmat(small_g)  451.333  471.8565  840.4742  521.865  598.0845
#>        max neval cld
#>   9152.262   100   b
#>  27139.262   100  a
microbenchmark(sc_apply(mid_g), sc_distmat(mid_g))
#> Unit: microseconds
#>               expr       min        lq       mean    median         uq
#>    sc_apply(mid_g) 11006.113 11327.794 13590.9536 12919.492 15397.2510
#>  sc_distmat(mid_g)   757.752   795.308   949.2551   841.834   965.4545
#>        max neval cld
#>  27068.296   100   b
#>   2061.824   100  a
microbenchmark(sc_apply(big_g), sc_distmat(big_g))
#> Unit: milliseconds
#>               expr      min        lq      mean    median        uq
#>    sc_apply(big_g) 23.11678 26.696373 29.940675 29.191045 33.012796
#>  sc_distmat(big_g)  1.67531  1.727873  2.156307  1.855994  2.244872
#>        max neval cld
#>  47.081647   100   b
#>   7.576123   100  a

As you can see the distances() approach is faster, and increasingly faster as the size of the graph grows.

Created on 2019-01-10 by the reprex package (v0.2.1)

gfgm
  • 3,627
  • 14
  • 34
  • 1
    I like it, it's elegant. However, [`drake`'s clunky approach with `adjacent_vertices()`](https://github.com/ropensci/drake/blob/master/R/utils-igraph.R#L46-L57) actually appears to scale better. [Benchmarks here](https://github.com/ropensci/drake/issues/665#issue-397937277). – landau Jan 10 '19 at 17:38
  • Sorry, I forgot to mention that this is what `drake` is now doing. Please see the 2019-01-10 edit in the OP. – landau Jan 10 '19 at 17:50
  • 1
    hey @landau, cool benchmarks (and cool package incidentally!). You might be being too modest and your implementation is already excellent! However, looking at the benchmarks you posted I'm a bit confused because the `sc_drake()` function seems to return completely different output: a single vector of vertices rather than a list of vectors of vertices. E.g. `sc_drake(big_g)` returns a length-200 vector to me which can't be the right output. Do you get the same? Maybe I'm copying the function to my computer wrong. – gfgm Jan 11 '19 at 13:06
  • Thanks, `gfgm`! The output format is actually intentional. I find the usual list of vectors unhelpful for my use case. What `drake` really needs is the vector of vertices that `sc_drake()` currently produces. I realize this may not be an apples-to-apples comparison for the purposes of benchmarking against `distances()`, but it is exactly the output I am after. – landau Jan 11 '19 at 13:38
  • sorry follow up q: if I run `set.seed(42); small_g <- named_graph(10, .2); plot(small_g)` you can see right away that vertices 1-9 are a separate component. If I then call `sc_drake(small_g)` the output is just the vector 1:10. How can you know from this that these are separate components? Indeed how can you know from that output which vertices depend on which other vertices? – gfgm Jan 11 '19 at 13:43
  • You are right, we can't parse components or individual dependency relationships from the output of `sc_drake()`. However, `drake` does not need this information from `sc_drake()`. Given an initial vector of vertices `v`, all `drake` needs to know is which vertices are upstream (or downstream) of anything in `v`. – landau Jan 11 '19 at 13:48
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/186547/discussion-between-gfgm-and-landau). – gfgm Jan 11 '19 at 13:52
  • Sorry, my company blocks chats. How about https://github.com/ropensci/drake/issues/665? – landau Jan 11 '19 at 14:02
  • For onlookers: the discussion continues at https://github.com/ropensci/drake/issues/665#issuecomment-453529369 – landau Jan 11 '19 at 16:41