2

I have been recently working with graph, specifically in detecting community structure in graph, and this requires me to program random walks.

I have ran my code several times but all of them show a very slow performance of the random walkers. Meanwhile, the pre-coded algorithm in the igraph packages runs very smoothly. For example, if I run 1000 walkers on a graph of 10000 nodes, it would take me a few hours just for the task while the cluster_walktrap function only takes a few seconds.

My random walk algorithm is as follow:

enter image description here

Here is my code in R:

library("igraph")
library("tictoc")

network <- read.csv("mydata", header=T) #Data consisted of two columns Source and Target
network$Source = as.factor(network$Source)#CONVERT TO FACTOR
network$Target = as.factor(network$Target)#CONVERT TO FACTOR
g <- graph_from_edgelist(as.matrix(network), directed=FALSE) 
Al <- get.adjlist(g)
n <- length(Al)
d<- matrix(, ncol=n, nrow=1)  #Degree vector
for (i in 1:n) {            
  d[i] <- length(Al[[i]])
}

tic("Simple random walk")
K <- 5000
t <- 5                #length of random walks
x <- vector("list", K) #Generate random walk
Prw <- matrix(0, nrow=n, ncol=n)#Generate Probability Matrix
for (k in 1:n) {      #For k-th node
  for (i in 1:K) {   #For i-th walker
    for (j in 2:t) { #For j-th step
      x[[i]][1] <- k   #Set starting as node k
      prev <- x[[i]][j - 1]     #Get neighbors
      x[[i]][j] <- Al[[prev]][sample(1:d[prev], 1)]  #Choose a random neighbor
    }
    Prw[k, x[[i]][t]] <- Prw[k, x[[i]][t]] + 1/K  #Plus 1/K the probability 
    #to move to where the walker stop
  }
}

Is there anyway that I can improve the speed of my algorithm?

Small sample of dataset

network <- structure(list(Source = c(828L, 884L, 856L, 889L, 719L, 861L, 
840L, 864L, 719L, 745L, 708L, 888L, 825L, 869L, 729L, 769L, 823L, 
861L, 774L, 805L, 713L, 774L, 729L, 697L, 823L), Target = c(697L, 
864L, 869L, 856L, 713L, 863L, 803L, 856L, 840L, 805L, 823L, 889L, 
889L, 774L, 888L, 869L, 840L, 729L, 769L, 800L, 819L, 856L, 876L, 
890L, 810L)), row.names = c(NA, 25L), class = "data.frame")

The link to my full data set

https://drive.google.com/file/d/1iKSTx_2Kjk701dwmueHQ_xsxsFUKngKq/view?usp=sharing

  • You should show us your implementation of these loops if you want optimization advice. – user2554330 Dec 27 '21 at 12:46
  • Thank you. Please see my code in the edited post. – Quang Dũng Đinh Dec 27 '21 at 13:09
  • 1
    You should copy and paste your codes instead of taking a screenshot. – maydin Dec 27 '21 at 13:22
  • Please work trough our tutorial on [how-to-make-a-great-r-reproducible-example](https://stackoverflow.com/a/5963610/6574038), thanks. Cheers! – jay.sf Dec 27 '21 at 13:41
  • @maydin Ah yes. My bad. Its the first time I posted on the site. – Quang Dũng Đinh Dec 27 '21 at 13:44
  • 1
    @jay.sf Thank you. I will take a look on that. – Quang Dũng Đinh Dec 27 '21 at 13:45
  • Great, this is much better! Now state `Al` since its unknown and we are not able to run your code without it, after that I think we can reopen your question. – jay.sf Dec 27 '21 at 13:52
  • @jay.sf Yes I have edited that! Thanks. – Quang Dũng Đinh Dec 27 '21 at 14:00
  • @G.Grothendieck Yes please check my edited post. Thanks! – Quang Dũng Đinh Dec 27 '21 at 14:00
  • @G.Grothendieck I have added the link to my file since it is quite long to be attached to this post. Thank you. – Quang Dũng Đinh Dec 27 '21 at 14:23
  • 2
    Normally it is expected that posts be self contained and you provide a subset of the data using `dput` right in the post itself such that it is sufficient to address the problem. or provide code that generates the input from scratch or use datasets that come with R. See the info at the top of the [tag:r] tag page. External links can become invalid in the future reducing the value of the question. – G. Grothendieck Dec 27 '21 at 14:31
  • Actually it's all explained in [how-to-make-a-great-r-reproducible-example](https://stackoverflow.com/a/5963610/6574038), especially that it's seldom needed to provide all the data which is also considered as bad practise. It can't hurt to look at this more closely. Cheers! – jay.sf Dec 27 '21 at 14:37
  • 1
    Oh my. Thanks a bunch to whoever helped added the data to my post. I appreciate that a lot. – Quang Dũng Đinh Dec 27 '21 at 15:20
  • Your sample `network` data is not working, though. – jay.sf Dec 27 '21 at 15:36
  • `cluster_walktrap` does not simulate any random walks. It operates with probabilities defined based on random walks, but it computes these probabilities exactly (and not through an explicit simulation of random walks). – Szabolcs Dec 27 '21 at 17:53
  • Note, however, that igraph does have an efficient `random_walk` function, which will very likely be much faster than any direct implementation in R. – Szabolcs Dec 27 '21 at 17:53
  • 1
    @jay.sf Sorry. The columns need to be factored too. I added the codes. – Quang Dũng Đinh Dec 28 '21 at 03:45
  • @Szabolcs Thanks. However, I don't really understand the fact that when I try with a graph of 10000 nodes, the function still operates quickly at only a few seconds (taking 5 steps for the random walk), but when I calculated the exact fifth power of the matrix, it will me a whole day. – Quang Dũng Đinh Dec 28 '21 at 03:48

0 Answers0