5

Good evening,

I am trying to analyse the forementioned data(edgelist or pajek format). First thought was R-project with igraph package. But memory limitations(6GB) wont do the trick. Will a 128GB PC be able to handle the data? Are there any alternatives that don't require whole graph in RAM?

Thanks in advance.

P.S: I have found several programs but I would like to hear some pro(yeah, that's you) opinions on the matter.

Giannis H.
  • 145
  • 1
  • 1
  • 9
  • 1
    When you say "analyse" can you be more specific as to what you are trying to do? – ose Mar 10 '12 at 13:13
  • Of course. I want to calculate degrees(in,out,total), which I will use to plot distributions. I want to be able to move nodes and edges from the large graph to smaller graphs(sampling processes), where add.vertices and add.edges from igraph come very handy. – Giannis H. Mar 10 '12 at 15:14
  • What is the format of the data? Is the edgelist alone 60gb? (i.e. Is it a text file where each row contains two numbers representing the sender and receiver of a single edge?) – Christopher DuBois Mar 11 '12 at 07:41
  • Yes, it's an edgelist txt file and each row contains two IDs, representing the directed edge from 1st ID to the 2nd ID. – Giannis H. Mar 11 '12 at 12:43

1 Answers1

6

If you only want degree distributions, you likely don't need a graph package at all. I recommend the bigtablulate package so that

  1. your R objects are file backed so that you aren't limited by RAM
  2. you can parallelize the degree computation using foreach

Check out their website for more details. To give a quick example of this approach, let's first create an example with an edgelist involving 1 million edges among 1 million nodes.

set.seed(1)
N <- 1e6
M <- 1e6
edgelist <- cbind(sample(1:N,M,replace=TRUE),
                  sample(1:N,M,replace=TRUE))
colnames(edgelist) <- c("sender","receiver")
write.table(edgelist,file="edgelist-small.csv",sep=",",
            row.names=FALSE,col.names=FALSE)

I next concatenate this file 10 times to make the example a bit bigger.

system("
for i in $(seq 1 10) 
do 
  cat edgelist-small.csv >> edgelist.csv 
done")

Next we load the bigtabulate package and read in the text file with our edgelist. The command read.big.matrix() creates a file-backed object in R.

library(bigtabulate)
x <- read.big.matrix("edgelist.csv", header = FALSE, 
                     type = "integer",sep = ",", 
                     backingfile = "edgelist.bin", 
                     descriptor = "edgelist.desc")
nrow(x)  # 1e7 as expected

We can compute the outdegrees by using bigtable() on the first column.

outdegree <- bigtable(x,1)
head(outdegree)

Quick sanity check to make sure table is working as expected:

# Check table worked as expected for first "node"
j <- as.numeric(names(outdegree[1]))  # get name of first node
all.equal(as.numeric(outdegree[1]),   # outdegree's answer
          sum(x[,1]==j))              # manual outdegree count

To get indegree, just do bigtable(x,2).

Christopher DuBois
  • 42,350
  • 23
  • 71
  • 93
  • Sο if I get it right, we are moving the problem to matrices computation. I like that. Please provide an example. – Giannis H. Mar 11 '12 at 12:51
  • Seems easy and scalable for degree computation. Can it handle graph mannipulation? Adding substracting etc? I must read their documentation. Thanks for posting Christopher. – Giannis H. Mar 12 '12 at 15:49