Speed Doing Many Joins in R

Question

I have a data frame similar this one:

n = c(rep("x", 3), rep("y", 5), rep("z", 2)) 
s = c("aa", "bb", "cc", "dd", "ee", "aa", "bb", "cc", "dd", "ff") 
df = data.frame(n, s)

I want to find the number of matches for each unique df$n with every other df$n if I were to join them on df$s. The following works, but it is very slow, and I have large dataset. Is there a faster way to approach this problem?

place <- unique(df$n)
df_answer <- data.frame(place1 ="test1", place2 = "test2", matches = 2)
for(i in place) { 
  for(k in place) { 
    m1 <- inner_join(filter(df, n == i), filter(df, n == k), by = "s")
    m2 <- data.frame(place1 = i, place2 = k, matches = length(m1$s))
    df_answer <- rbind(df_answer, m2)
  } 
} 
df_answer <- filter(df_answer, place1 != "test1")

almost identical to http://stackoverflow.com/questions/28761809/summarize-the-self-join-index-while-avoiding-cartesian-product-in-r-data-table/28773814#28773814 — eddi, Mar 02 '15 at 16:30

thelatemail · Answer 1 · 2015-03-01T23:26:26.117

3

You could probably get around a lot of this looping etc just using a couple of merge calls:

ans <- expand.grid(place1=unique(df$n),place2=unique(df$n))
counts <- aggregate(s ~ ., data=
           setNames(merge(df, df, by="s",all=TRUE),c("s","place1","place2")), FUN=length)
merge(ans, counts, all=TRUE)

#  place1 place2  s
#1      x      x  3
#2      x      y  3
#3      x      z NA
#4      y      x  3
#5      y      y  5
#6      y      z  1
#7      z      x NA
#8      z      y  1
#9      z      z  2

I'm hopeless with dplyr, but maybe something like this would be analogous:

expand.grid(n.x=unique(df$n), n.y=unique(df$n)) %>%
left_join(
          inner_join(df,df,by="s") %>% 
          group_by(n.x,n.y) %>% 
          summarise(s=length(s))
         )

edited Mar 01 '15 at 23:26

answered Mar 01 '15 at 23:02

thelatemail

91,185
12
128
188

2

`data.table` ftw... `setkey(setDT(df), s)[df][, length(s), key = .(n, i.n)][CJ(unique(df$n), unique(df$n))]` – David Arenburg Mar 01 '15 at 23:12
@DavidArenburg - post that as an answer. – thelatemail Mar 01 '15 at 23:19

score 2 · Accepted Answer · answered Mar 01 '15 at 22:49

you should always avoid using rbind in a loop. The reason is that every time you use it copies of the dataset are created and as this grows these copies take longer and longer to be made. I suspect this is the reason your code is slow and not the use of inner_join. The solution to this is to store the output of each iteration in a list, and at the end rbind all the objects in the list at once.

There is a faster way to get your answer, by using

length(intersect(filter(df, n == i)$s, filter(df, n == k)$s))

to calculate the number of matches, avoiding the join, since what you are essentially calculating is the number of elements in the intersection of these two sets. This is a symmetric operation, so you don't need to do it twice for each pair. So I would rewrite the loop as

place <- unique(df$n)
df_answer <- vector("list", length(place) * (length(place) - 1))
j <- 1
for (i in seq_along(place)) { 
    for (k in seq_len(i)) { 
        df_answer[[j]] <- data.frame(
                  place1 = place[i],
                  place2 = place[k], 
                  matches = length(intersect(filter(df, n == place[i])$s,
                            filter(df, n == place[k])$s)))
        j <- j + 1
    } 
} 
df_answer <- do.call(rbind, df_answer) # Convert to data frame format

Also note that in your original answer, you don't need to create a data frame with a single row and then remove it. You can create data frames with no rows like this

data.frame(place1 = character(0), place2 = character(0), matches = integer(0))

You can further speed up your code by just avoiding the case where i == k since then all rows match so it's just nrow(filter(df, n == place[i]))

This was the fastest of the answers I received. Thanks. – user1634075 Mar 06 '15 at 13:06 — user1634075, Mar 06 '15 at 13:06

Speed Doing Many Joins in R

2 Answers2