1

I'm looking for a fast and scalable solution to coerce a massive data.frame from a long format to an edgelist in R.

Consider the following data.frame:

df1 <- data.frame(ID=c("A1", "A1", "A1", "B1", "B1", "B1"),
              score=c(3,4,5,3,6,5))

> df1
  ID score
1 A1     3
2 A1     4
3 A1     5
4 B1     3
5 B1     6
6 B1     5

The outcome should look like this. Note that the elements in score become nodes that are linked with ties if they are held by the same ID.

> el
  X Y
1 3 4
2 3 5
3 4 5
4 3 6
5 6 5

The original df1 has roughly 30 million observations from which an edgelist needs to be calculated frequently.

Henrik
  • 65,555
  • 14
  • 143
  • 159
wake_wake
  • 1,332
  • 2
  • 19
  • 46

1 Answers1

2

A popular (and efficient) tool for "large-ish" data is data.table:

library('data.table')
DT <- as.data.table(df1)
unique(DT[,as.data.frame(t(combn(score,2))), by = "ID"][,ID := NULL,])
#    V1 V2
# 1:  3  4
# 2:  3  5
# 3:  4  5
# 4:  3  6
# 5:  6  5
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • This code works great for the example data. However, in the original data I get the following error: "negative length vectors are not allowed". Does this, perhaps, correspond to the fact that some `ID` have only one score? – wake_wake Nov 18 '18 at 05:37
  • Likely, yes. Since those cannot contribute to edges, should they be filtered out before this process? – r2evans Nov 18 '18 at 06:00
  • Yes, I think so. Indeed, because they don't produce edges. – wake_wake Nov 18 '18 at 06:06
  • 1
    `unique(DT[,if (.N>1) as.data.frame(t(combn(score,2))), by = "ID"][,ID := NULL,])` ? – r2evans Nov 18 '18 at 06:49
  • This works great, @r2evans. I'm keeping this question unanswered for a bit so others can suggest an approach too. Will close it soon. Thank you! – wake_wake Nov 19 '18 at 04:34
  • Don't discount using Henrik's suggested `combnPrim`, as well, it looks to be drop-in compatible. – r2evans Nov 19 '18 at 14:22
  • Does `unique(d[d, on = .(ID, score < score), .(x = x.score, y = i.score), nomatch = 0L])` work? – Henrik Nov 19 '18 at 18:14