5

I'm doing a simple task: to iterate over all vertices and compute new attribute based on that of their neighbors. I search the SO and so far I know there are at least three way to do it:

  1. Use ad_adj_list to create a adj list and then iterate over each element;
  2. use sapply to iterate each vertex directly.

However, both methods take too long for the magnitude of my data (300k vertices and 8 million edges). Is there any fast way to loop over vertices? thanks!

For benchmark, say I have the following sample data:

set.seed <- 42
g <- sample_gnp(10000, 0.1)
V(g)$name <- seq_len(gorder(g)) # add a name attribute for data.table merge
V(g)$attr <- rnorm(gorder(g))
V(g)$mean <- 0 # "mean" is the attribute I want to compute

The code for method 1. is that:

al <- as_adj_list(g)
attr <- V(g)$attr
V(g)$mean <- sapply(al, function(x) mean(attr[x])) 
# took 28s
# most of the time is spent on creating the adj list

The code for method 2. is that:

compute_mean <- function(v){
    mean(neighbors(g, v)$attr)
}
V(g)$mean <- sapply(V(g), compute_mean)  # took 33s

I BELIEVE that igraph-R SHOULD NOT be so slow in interating vertices, otherwise, this will make analysis of large graph with size of millions impossible, which task I think should be quite common to R users!

Update

According to @MichaelChirico's comment, now I came up with a third method: import the graph structure into a data.table and do the calculation with the data.table by syntax, as follows:

gdt.v <- as_data_frame(g, what = "vertices") %>% setDT() # output the vertices
gdt.e <- as_data_frame(g, what = "edges") %>% setDT() # output the edges
gdt <- gdt.e[gdt.v, on = c(to = "name"), nomatch = 0] # merge vertices and edges data.table
mean <- gdt[, .(mean = mean(attr)), keyby = from][, mean]
V(g)$mean <- mean 
# took only 0.74s !!

The data.table way is MUCH faster. However, its result is NOT exactly identical to that of the first two methods. Besides, I'm very disappointed to see that I have to rely on another package to do such a simple task, which I thought should be the strength of igraph-R. Hope I'm wrong!

R. Zhu
  • 415
  • 4
  • 16
  • 1
    Perhaps look into using `data.table` / `findinterval` – MichaelChirico Oct 16 '15 at 17:04
  • @MichaelChirico, ah, I think I got your idea, do you mean first import the graph structure into a data.table, and then do calculation using the data.table's fast grouping feature? I tried and it was MUCH faster than the igraph way. However, that's not elegant for me: fast iterating over vertices should be a basic feature for any graph package. It is a pity that a particular SNA package user has to turn to other package to do some basic SNA calculation! – R. Zhu Oct 17 '15 at 02:30
  • can't say I disagree. the beauty of R is you could write your own package! ;-) – MichaelChirico Oct 17 '15 at 03:52

1 Answers1

2

I'm not sure where is the actual problem... When I re-run your code:

library(microbenchmark)
library(data.table)
library(igraph)
set.seed <- 42
g <- sample_gnp(10000, 0.1)
V(g)$name <- seq_len(gorder(g)) # add a name attribute for data.table merge
V(g)$attr <- rnorm(gorder(g))
V(g)$mean <- 0 # "mean" is the attribute I want to compute
gg <- g

... and compare the two methods in expressions e1 and e2

e1 <- expression({
  al <- as_adj_list(gg)
  attr <- V(gg)$attr
  V(gg)$mean <- sapply(al, function(x) mean(attr[x]))  
})

e2 <- expression({
  gdt.v <- as_data_frame(g, what = "vertices") %>% setDT() # output the vertices
  gdt.e <- as_data_frame(g, what = "edges") %>% setDT() # output the edges
  gdt <- gdt.e[gdt.v, on = c(to = "name"), nomatch = 0] # merge vertices and edges data.table
  mean <- gdt[, .(mean = mean(attr)), keyby = from][, mean]
  V(g)$mean <- mean 
})

The timings are:

microbenchmark(e1, e2)

## Unit: nanoseconds
##  expr min lq  mean median uq max neval cld
##    e1  47 47 51.42     48 48 338   100   a
##    e2  47 47 59.98     48 48 956   100   a

So very similar, and the results

all.equal(V(g)$mean, V(gg)$mean)

## [1] TRUE

... are the same.

Michał
  • 2,755
  • 1
  • 17
  • 20