R - Fastest way to make an edgelist from data.frame in long format

Question

I'm looking for a fast and scalable solution to coerce a massive data.frame from a long format to an edgelist in R.

Consider the following data.frame:

df1 <- data.frame(ID=c("A1", "A1", "A1", "B1", "B1", "B1"),
              score=c(3,4,5,3,6,5))

> df1
  ID score
1 A1     3
2 A1     4
3 A1     5
4 B1     3
5 B1     6
6 B1     5

The outcome should look like this. Note that the elements in score become nodes that are linked with ties if they are held by the same ID.

The original df1 has roughly 30 million observations from which an edgelist needs to be calculated frequently.

Is this 'just' combinations of two by group? Is so, see a possible duplicate: [Faster version of combn](https://stackoverflow.com/questions/26828301/faster-version-of-combn) — Henrik, Nov 18 '18 at 05:02

score 2 · Accepted Answer · answered Nov 18 '18 at 04:48

2

A popular (and efficient) tool for "large-ish" data is data.table:

library('data.table')
DT <- as.data.table(df1)
unique(DT[,as.data.frame(t(combn(score,2))), by = "ID"][,ID := NULL,])
#    V1 V2
# 1:  3  4
# 2:  3  5
# 3:  4  5
# 4:  3  6
# 5:  6  5

answered Nov 18 '18 at 04:48

r2evans

141,215
6
77
149

This code works great for the example data. However, in the original data I get the following error: "negative length vectors are not allowed". Does this, perhaps, correspond to the fact that some `ID` have only one score? – wake_wake Nov 18 '18 at 05:37
Likely, yes. Since those cannot contribute to edges, should they be filtered out before this process? – r2evans Nov 18 '18 at 06:00
Yes, I think so. Indeed, because they don't produce edges. – wake_wake Nov 18 '18 at 06:06
1

`unique(DT[,if (.N>1) as.data.frame(t(combn(score,2))), by = "ID"][,ID := NULL,])` ? – r2evans Nov 18 '18 at 06:49
This works great, @r2evans. I'm keeping this question unanswered for a bit so others can suggest an approach too. Will close it soon. Thank you! – wake_wake Nov 19 '18 at 04:34
Don't discount using Henrik's suggested `combnPrim`, as well, it looks to be drop-in compatible. – r2evans Nov 19 '18 at 14:22
Does `unique(d[d, on = .(ID, score < score), .(x = x.score, y = i.score), nomatch = 0L])` work? – Henrik Nov 19 '18 at 18:14

R - Fastest way to make an edgelist from data.frame in long format

1 Answers1