subcomponent(mode = "in") for multiple source vertices

Question

Question

In the igraph R package, is there an efficient implementation of subcomponent() and/or BFS that can handle multiple source vertices?

Motivation

The drake R package models a user's workflow as a DAG of interdependent objects and files. The DAG should only contain the user's targets and their upstream dependencies, so drake uses igraph::subcomponent() to eliminate superfluous vertices. This approach is inefficient because the v argument must be a single vertex, so drake ends up conducting a new BFS for every single target the user wants to build.

EDIT: 2019-01-10

drake now uses a different approach that ultimately relies on sequential calls to adjacent_vertices(). The approach is clunky, but the speed improvement is actually quite nice. Still holding out for something more elegant and sophisticated.

Can you maybe use neighbourhood functions to get all ancestors : `?ego` [& here](https://stackoverflow.com/questions/35603398/finding-the-complete-adjacency-matrix-in-r/35821304#35821304) — user20650, Jan 09 '19 at 23:16
`make_ego_graph()` creates a new graph for every single vertex in `v` and then puts those graphs together in a list. `ego()` is similar. I will check out the performance, but I am still looking for a more direct solution. — landau, Jan 10 '19 at 02:41
I took care of the performance issue by [applying `adjacent_vertices()` over and over again](https://github.com/ropensci/drake/pull/663), but [the technique I used](https://github.com/ropensci/drake/blob/9a23dde1e6d7babf2c832494198f18a25b59787e/R/utils-igraph.R#L46-L57) is really naive and unsatisfying. I am still holding out for something more sophisticated, if implemented already. — landau, Jan 10 '19 at 03:57

score 2 · Answer 1 · answered Jan 10 '19 at 14:43

I think you can do this using the distances() function which generates a matrix of distances (no of edges) between nodes. This seems to do the search once and is a good deal faster than iterating through each vertex.

Example code:

library(igraph)
library(microbenchmark)

# generate some random testing data
set.seed(1234)
g <- erdos.renyi.game(50, .01)

# Here we make a function that iterates 
# across the vector of IDs applying the function
# and returns a list where each element is the
# ids of the subcomponents
sc_apply <- function(g) {
  vs <- V(g)
  res <- sapply(vs, function(v){as.numeric( # to facilitate comparison
    subcomponent(g, v, mode = "in")
    )})
  res
}

# Try it for testing
t1 <- sc_apply(g)

# Here we get the matrix of node distances. Infinite distance
# implies a seperate component. We iterate through rows of
# matrix to extract the set of nodes that are connected 
sc_distmat <- function(g) {
  dmat <- distances(g, mode = "in")
  res <- apply(dmat, 1, function(row){which(is.finite(row))})
  res
}

# extract for testing
t2 <- sc_distmat(g)

# check that equal (we need to sort the 
# subcomponent list elements first to facilitate comparison)
all.equal(lapply(t1, sort), t2)
#> [1] TRUE

The results are identical -- although worth noting that if your graph is one giant component than apply will return a matrix to you instead of a list so you will need to do your comparisons in a slightly different way.

Ok now lets see how much/whether this is faster:

# generate graphs of different sizes (not too big because my
# laptop is borderline antique!)
set.seed(42)
small_g <- erdos.renyi.game(10, .2)
mid_g <- erdos.renyi.game(50, .1)
big_g <- erdos.renyi.game(100, .1)

# check speed improvement
microbenchmark(sc_apply(small_g), sc_distmat(small_g))
#> Unit: microseconds
#>                 expr      min        lq      mean   median        uq
#>    sc_apply(small_g) 2181.465 2243.4895 2734.7132 2313.005 2777.7135
#>  sc_distmat(small_g)  451.333  471.8565  840.4742  521.865  598.0845
#>        max neval cld
#>   9152.262   100   b
#>  27139.262   100  a
microbenchmark(sc_apply(mid_g), sc_distmat(mid_g))
#> Unit: microseconds
#>               expr       min        lq       mean    median         uq
#>    sc_apply(mid_g) 11006.113 11327.794 13590.9536 12919.492 15397.2510
#>  sc_distmat(mid_g)   757.752   795.308   949.2551   841.834   965.4545
#>        max neval cld
#>  27068.296   100   b
#>   2061.824   100  a
microbenchmark(sc_apply(big_g), sc_distmat(big_g))
#> Unit: milliseconds
#>               expr      min        lq      mean    median        uq
#>    sc_apply(big_g) 23.11678 26.696373 29.940675 29.191045 33.012796
#>  sc_distmat(big_g)  1.67531  1.727873  2.156307  1.855994  2.244872
#>        max neval cld
#>  47.081647   100   b
#>   7.576123   100  a

As you can see the distances() approach is faster, and increasingly faster as the size of the graph grows.

^{Created on 2019-01-10 by the reprex package (v0.2.1)}

I like it, it's elegant. However, [`drake`'s clunky approach with `adjacent_vertices()`](https://github.com/ropensci/drake/blob/master/R/utils-igraph.R#L46-L57) actually appears to scale better. [Benchmarks here](https://github.com/ropensci/drake/issues/665#issue-397937277). — landau, Jan 10 '19 at 17:38
Sorry, I forgot to mention that this is what `drake` is now doing. Please see the 2019-01-10 edit in the OP. — landau, Jan 10 '19 at 17:50
hey @landau, cool benchmarks (and cool package incidentally!). You might be being too modest and your implementation is already excellent! However, looking at the benchmarks you posted I'm a bit confused because the `sc_drake()` function seems to return completely different output: a single vector of vertices rather than a list of vectors of vertices. E.g. `sc_drake(big_g)` returns a length-200 vector to me which can't be the right output. Do you get the same? Maybe I'm copying the function to my computer wrong. — gfgm, Jan 11 '19 at 13:06
Thanks, `gfgm`! The output format is actually intentional. I find the usual list of vectors unhelpful for my use case. What `drake` really needs is the vector of vertices that `sc_drake()` currently produces. I realize this may not be an apples-to-apples comparison for the purposes of benchmarking against `distances()`, but it is exactly the output I am after. — landau, Jan 11 '19 at 13:38
sorry follow up q: if I run `set.seed(42); small_g <- named_graph(10, .2); plot(small_g)` you can see right away that vertices 1-9 are a separate component. If I then call `sc_drake(small_g)` the output is just the vector 1:10. How can you know from this that these are separate components? Indeed how can you know from that output which vertices depend on which other vertices? — gfgm, Jan 11 '19 at 13:43
You are right, we can't parse components or individual dependency relationships from the output of `sc_drake()`. However, `drake` does not need this information from `sc_drake()`. Given an initial vector of vertices `v`, all `drake` needs to know is which vertices are upstream (or downstream) of anything in `v`. — landau, Jan 11 '19 at 13:48
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/186547/discussion-between-gfgm-and-landau). — gfgm, Jan 11 '19 at 13:52
Sorry, my company blocks chats. How about https://github.com/ropensci/drake/issues/665? — landau, Jan 11 '19 at 14:02
For onlookers: the discussion continues at https://github.com/ropensci/drake/issues/665#issuecomment-453529369 — landau, Jan 11 '19 at 16:41

subcomponent(mode = "in") for multiple source vertices

Question

Motivation

EDIT: 2019-01-10

1 Answers1