you should always avoid using rbind
in a loop. The reason is that every time you use it copies of the dataset are created and as this grows these copies take longer and longer to be made. I suspect this is the reason your code is slow and not the use of inner_join
. The solution to this is to store the output of each iteration in a list, and at the end rbind
all the objects in the list at once.
There is a faster way to get your answer, by using
length(intersect(filter(df, n == i)$s, filter(df, n == k)$s))
to calculate the number of matches, avoiding the join, since what you are essentially calculating is the number of elements in the intersection of these two sets. This is a symmetric operation, so you don't need to do it twice for each pair. So I would rewrite the loop as
place <- unique(df$n)
df_answer <- vector("list", length(place) * (length(place) - 1))
j <- 1
for (i in seq_along(place)) {
for (k in seq_len(i)) {
df_answer[[j]] <- data.frame(
place1 = place[i],
place2 = place[k],
matches = length(intersect(filter(df, n == place[i])$s,
filter(df, n == place[k])$s)))
j <- j + 1
}
}
df_answer <- do.call(rbind, df_answer) # Convert to data frame format
Also note that in your original answer, you don't need to create a data frame with a single row and then remove it. You can create data frames with no rows like this
data.frame(place1 = character(0), place2 = character(0), matches = integer(0))
You can further speed up your code by just avoiding the case where i == k
since then all rows match so it's just nrow(filter(df, n == place[i]))