I have a data thats's a bit like this:
items <- c("A", "A", "A", "A", "B", "B", "B", "C", "C", "C", "C", "C", "D", "D")
features <- c("ab", "ac", "ad", "ab", "ab", "az", "ay", "az", "az", "al", "ab", "ad", "aa", "ac")
df <- data.frame(items, features)
Which gives a data frame like:
items features
1 A ab
2 A ac
3 A ad
4 A ab
5 B ab
6 B az
7 B ay
8 C az
9 C az
10 C al
11 C ab
12 C ad
13 D aa
14 D ac
I would like to create two pairwise comparisons from the above data frame. The first is a comparison of every item with every other item that counts how many features are shared between them. The second is a comparison of every item with every other item that gives a character string of the shared features (separated by spaces, for example).
I have been able to do this using a loop with another loop within it to compare "A" with "B" and so on, and then "B" with all others, and so on, but this is a very slow process. The real data frame has ~2000 items in it and the compute time gets out of hand pretty quickly as the dataset grows.
The code I have used is like this:
item_list <- unique(df$items)
feature_count <- data.frame(matrix(ncol = length(item_list), nrow = length(item_list)))
colnames(feature_count) <- item_list
rownames(feature_count) <- item_list
feature_details <- data.frame(matrix(ncol = length(item_list), nrow = length(item_list)))
colnames(feature_details) <- item_list
rownames(feature_details) <- item_list
for (n in 1:length(item_list)){
item <- df[df$item == item_list[n],]
item_features <- as.list(item$features)
for (z in 1:length(item_list)){
comparison <- df[df$item == item_list[z],]
comparison_features <- as.list(comparison$features)
if (length(intersect(item_features, comparison_features)) == 0) {
feature_count[z,n] <- length(intersect(item_features, comparison_features))
feature_details[z,n] <- NA
} else {
feature_count[z,n] <- length(intersect(item_features, comparison_features))
feature_details[z,n] <- paste(intersect(item_features, comparison_features), collapse = " ")
}
}
diag(feature_count) <- 0
diag(feature_details) <- NA
}
And returns two data frames like this:
feature_count
A B C D
A 0 1 2 1
B 1 0 2 0
C 2 2 0 0
D 1 0 0 0
feature_details
A B C D
A <NA> ab ab ad ac
B ab <NA> az ab <NA>
C ab ad ab az <NA> <NA>
D ac <NA> <NA> <NA>
The above seems like an inelegant and inefficient way of doing this. Could anyone offer any advice on a simpler approach to achieve the same thing that will make working with much, much larger datasets more doable?