1

I have a network defined by a list of edges. The network is large and sparse. For each pair of connected vertices, I would like to calculate the number of common neighbours. This post discusses how to do this for a single pair of vertices, but it strikes me as inefficient to loop over all edges to calculate this statistic for each edge in the graph. Instead, the statistic I'm after can be calculated from the product of the adjacency matrix with itself, as follows:

library(igraph)
library(data.table)
set.seed(1111)

E <- data.table(i = sample(as.character(1:5e4), 1e5, replace = T),
                j = sample(as.character(1:5e4), 1e5, replace = T))
G <- simplify(graph_from_data_frame(E, directed = F))  # remove loops and multiples
N <- as_adjacency_matrix(G) %*% as_adjacency_matrix(G)

However, I don't know how to efficiently get the information out of the resulting matrix N, without looping over all the cells, which would look like this:

extract_entries <- function(x, M) {
 
  nl <- M@p[x] + 1  # index from 1, not 0
  nu <- M@p[x+1]
  j.col <- M@Dimnames[[1]][M@i[nl:nu] + 1]
  i.col <- M@Dimnames[[2]][x]
  nb.col <- M@x[nl:nu]
  
  data.table(i = i.col, j = j.col, nb = nb.col)
  
}
  
system.time(E.nb <- rbindlist(lapply(1:N@Dim[1], extract_entries, N), fill = T))

   # user  system elapsed 
  #  8.29    0.02    8.31 

E <- E.nb[E, on = c('i', 'j')][is.na(nb), nb := 0]

Even in the reproducible example above, looping is slow, and the true graph might have millions of vertices and tens of millions of edges. My end goal is to add a column to the data frame E with the number of common neighbours for each edge, as illustrated in the MWE.

My question is: is there a (much) more efficient way of extracting the number of common neighbours for each pair of vertices and merging this information back into the list of edges?

I have seen that the package diagramme_R includes a function that calculates the number of common neighbours, however it again appears intended to be used for a limited number of edges, and wouldn't solve the problem of adding the information on the number of common neighbours back to the original data frame.

ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
s.willis
  • 347
  • 2
  • 11
  • `N@x` gives you all the values... does that help? – Wimpel Oct 11 '22 at 15:20
  • If two connected nodes are both connected to a third node, it will form a triangle. Take a look at `igraph`'s `triangles` function. – jblood94 Oct 11 '22 at 16:00
  • ...or better yet, the `count_triangles` function. – jblood94 Oct 11 '22 at 16:12
  • @jblood94 the ```count_triangles``` function counts the number of triangles for a vertex, I'm after something like the number of triangles for an edge. – s.willis Oct 11 '22 at 17:17
  • @Wimpel I'm after a way to efficiently extract that information, I've edited the question to reflect this. – s.willis Oct 11 '22 at 17:19
  • Just as a suggestion, you might also want to take a look at the `similarity` function. It calculates for each pair of nodes the proportion of common neighbours. The possible "downside" is that this also calculates the common neighbours for pairs of nodes other than edges, and returns the results as a full matrix. – Vincent Traag Oct 12 '22 at 10:46

2 Answers2

1

You're pretty much there. Just a couple things: converting N to a triangular matrix lets us build E.nb without lapply. Also, the i-j ordering is messing up the E.nb[E join. Temporarily sorting each row fixes this.

I've also included a function that uses the igraph triangles function instead of squaring the adjacency matrix, which is a bit faster on this example dataset.

library(igraph)
library(data.table)
library(Matrix)
set.seed(1111)

E <- data.table(i = sample(as.character(1:5e4), 1e5, replace = TRUE),
                j = sample(as.character(1:5e4), 1e5, replace = TRUE))

f1 <- function(E) {
  blnSort <- E[[1]] > E[[2]]
  E[blnSort, 2:1 := .SD, .SDcols = 1:2]
  G <- simplify(graph_from_data_frame(E, directed = FALSE))  # remove loops and multiples
  N <- as(tril(as_adjacency_matrix(G) %*% as_adjacency_matrix(G), -1), "dtTMatrix")
  data.table(
    i = N@Dimnames[[1]][N@i + 1],
    j = N@Dimnames[[2]][N@j + 1],
    nb = as.integer(N@x)
  )[
    i > j, 2:1 := .(i, j)
  ][
    E, on = .(i, j)
  ][
    is.na(nb), nb := 0L
  ][
    blnSort, 2:1 := .(i, j)
  ]
}

f2 <- function(E) {
  blnSort <- E[[1]] > E[[2]]
  E[blnSort, 2:1 := .SD, .SDcols = 1:2]
  G <- simplify(graph_from_data_frame(E, directed = FALSE))  # remove loops and multiples
  tri <- matrix(triangles(G), ncol = 3, byrow = TRUE)
  data.table(
    i = names(V(G)[tri[,c(1, 1, 2)]]),
    j = names(V(G)[tri[,c(2, 3, 3)]])
  )[
    i > j, 2:1 := .(i, j)
  ][
    ,.(nb = .N), .(i, j)
  ][
    E, on = .(i, j)
  ][
    is.na(nb), nb := 0L
  ][
    blnSort, 2:1 := .(i, j)
  ]
}

microbenchmark::microbenchmark(f1 = f1(EE),
                               f2 = f2(EE),
                               setup = {EE <- copy(E)})
#> Unit: milliseconds
#>  expr      min       lq     mean   median       uq      max neval
#>    f1 257.4803 281.8928 325.0267 303.4441 370.4977 478.7524   100
#>    f2 123.5213 139.5152 169.3914 151.3065 190.7800 284.2644   100
identical(f1(copy(E)), f2(copy(E)))
#> [1] TRUE
jblood94
  • 10,340
  • 1
  • 10
  • 15
0

I think you can simply the a matrix for indexing (since you are using character as vertex name, which turns to be the row/column names of the sparse matrix N, so we have to use match to find the corresponding indices)

cbind(df, nb = N[matrix(match(as.matrix(df), colnames(N)), ncol = 2)])

Example

Given an edge list df and resulting sparse matrix N

df <- data.frame(
  from = letters[c(1, 1, 2, 2, 6, 6, 6, 1, 1, 1)],
  to = letters[c(2, 4, 3, 5, 3, 2, 5, 5, 6, 3)]
)
g <- graph_from_data_frame(df, directed = FALSE)
m <- get.adjacency(g)
N <- m %*% m

by running the solution above, we will obtain

> cbind(df, nb = N[matrix(match(as.matrix(df), colnames(N)), ncol = 2)])
   from to nb
1     a  b  3
2     a  d  0
3     b  c  2
4     b  e  2
5     f  c  2
6     f  b  3
7     f  e  2
8     a  e  2
9     a  f  3
10    a  c  2

enter image description here

ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81