5

I have a data frame of 205,000+ rows formatted as follows:

df <- data.frame(project.id = c('SP001', 'SP001', 'SP001', 'SP017', 'SP018', 'SP017'),
                 supplier.id = c('1224', '5542', '7741', '1224', '2020', '9122'))

In the actual data frame there are 6700+ unique values of project.id. I would like to create an edge list that pairs suppliers who have worked on the same project.

Desired end result for project.id = SP001:

to     from
1224   5542
1224   7741
5542   7741

So far I've tried using split to create a list by project.id and then running lapply+combn to generate all possible combinations of supplier.id within each list/group:

try.list <- split(df, df$project.id)
try.output <- lapply(try.list, function(x) combn(x$supplier.id, 2))

Is there a more elegant/efficient (read "computed in less than 2hrs") way to generate something like this?

Any help would be much appreciated

  • I'd consider the `igraph` or similar packages - see for example here: http://stackoverflow.com/questions/12135971/identify-groups-of-linked-episodes-which-chain-together where I was doing the reverse. – thelatemail Jan 08 '16 at 05:40

3 Answers3

7

Instead of using split and lapply, you can use the dplyr package.

df <- data.frame(project.id = c('SP001', 'SP001', 'SP001', 'SP017', 'SP018', 'SP017'),
                 supplier.id = c('1224', '5542', '7741', '1224', '2020', '9122'),
                 stringsAsFactors = FALSE)

library(dplyr)

df %>% group_by(project.id) %>%
  filter(n()>=2) %>% group_by(project.id) %>%
 do(data.frame(t(combn(.$supplier.id, 2)), stringsAsFactors=FALSE))
# Source: local data frame [4 x 3]
# Groups: project.id [2]

#   project.id    X1    X2
#        (chr) (chr) (chr)
# 1      SP001  1224  5542
# 2      SP001  1224  7741
# 3      SP001  5542  7741
# 4      SP017  1224  9122
Ven Yao
  • 3,680
  • 2
  • 27
  • 42
  • This is actually what I was hoping to use originally - thank you for the pointer on `filter(n()>=2`. This solution produces the desired result in 38.46sec - much, much better than what I was able to accomplish. – user3179350 Jan 08 '16 at 16:59
  • Surely `dplyr` has an analogous `join` similar to @42-'s answer below? I'd be very surprised if running `combn` and `data.frame` and `t` for each and every group is even remotely efficient. – thelatemail Jan 08 '16 at 21:24
  • No need to do the `group_by(project.id)` again. – osbon123 Oct 13 '17 at 06:00
3

You can just merge it with itself which gets you all the Cartesian pairs:

 temp <- merge(df,df, by="project.id")
 res <- temp[ temp$supplier.id.x != temp$supplier.id.y , ]

> res

   project.id supplier.id.x supplier.id.y
2       SP001          1224          5542
3       SP001          1224          7741
4       SP001          5542          1224
6       SP001          5542          7741
7       SP001          7741          1224
8       SP001          7741          5542
11      SP017          1224          9122
12      SP017          9122          1224
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • Your method is faster, however, it is duplicating the connections. I don't know exactly what it is happening under the hood... But is it that your method, and `get.edgelist(graph.adjacency(crossprod(table(df))))` are truly equivalent? – Mario GS Mar 13 '17 at 21:14
2

We can try with igraph

library(igraph)
m1 <- get.edgelist(graph.adjacency(crossprod(table(df))))
m1[m1[,1]!= m1[,2],]
#      [,1]   [,2]  
#[1,] "1224" "5542"
#[2,] "1224" "7741"
#[3,] "1224" "9122"
#[4,] "5542" "1224"
#[5,] "5542" "7741"
#[6,] "7741" "1224"
#[7,] "7741" "5542"
#[8,] "9122" "1224"
akrun
  • 874,273
  • 37
  • 540
  • 662
  • @akrum, I like your answer because is more connected to linear algebra. However, I tried to replicate this in a larger data.set, it has 8.5K IDs and 14.6K vertices, and its crashing my computer every time. I was wondering if there is a more efficient method? – Mario GS Mar 13 '17 at 09:24